Due to an influx of spam, we have had to impose restrictions on new accounts. Please see this wiki page for instructions on how to get full permissions. Sorry for the inconvenience.
Admin message
The migration is almost done, at least the rest should happen in the background. There are still a few technical difference between the old cluster and the new ones, and they are summarized in this issue. Please pay attention to the TL:DR at the end of the comment.
With few months old git version of Mesa & X server, perf drop in SymMark Fill* tests was also ~20%.
There seems also to be few percent improvement in synMark TexMem* tests kernel performance at the same time, but that's visible only with specific Mesa driver (i965, or Iris) and X server versions. Performance change in other tests than memory bandwidth ones isn't significantly impacted by Mesa/Xorg version, only by kernel.
This drop is specific SkullCanyon, it's not visible on others platforms (KBL GT3e, SKL/BDW GT2, BXT). While 3D benchmarks are impacted most, there seems to be marginal perf drop also in Media (transcode) tests.
Although this impacts only SkullCanyon, setting severity as major because the perf drop is so large.
To upload designs, you'll need to enable LFS and have an admin enable hashed storage. More information
Child items 0
Show closed items
No child items are currently assigned. Use child items to break down this issue into smaller parts.
Linked items 0
Link issues together to show that they're related.
Learn more.
drm/i915: Make i915_vma.flags atomic_t for mutex reduction
drm/i915: Make shrink/unshrink be atomic
are meh.
drm/i915: Whitelist COMMON_SLICE_CHICKEN2
is a possiblity, but my money is on
drm/i915: Force compilation with intel-iommu for CI validation
A run with intel_iommu=off should test that theory, or intel_iommu=igfx_off
and reverting HAX iommu/intel: Ignore igfx_off
We run all tests currently with "intel_iommu=igfx_off" kernel command line option, and while the author-date in above intel-iommu/igfx_off commits is within range, their drm-tip repo commit dates are actually from Monday this week, not from week ago?
(Why IOMMU perf impact would be SKL GT4e specific?)
There's no such output for the 2019-09-11 "b27acd37b7de" kernel where this regression was noticed.
There is a difference on kernel IOMMU outputs between these commits though...
Before:
[ 0.625811] DMAR: No ATSR found
[ 0.625842] DMAR: dmar1: Using Queued invalidation
[ 0.625948] pci 0000:00:00.0: Adding to iommu group 0
[ 0.625993] pci 0000:00:08.0: Adding to iommu group 1
[ 0.626054] pci 0000:00:14.0: Adding to iommu group 2
[ 0.626064] pci 0000:00:14.2: Adding to iommu group 2
[ 0.626110] pci 0000:00:16.0: Adding to iommu group 3
[ 0.626159] pci 0000:00:1c.0: Adding to iommu group 4
[ 0.626203] pci 0000:00:1c.1: Adding to iommu group 5
[ 0.626249] pci 0000:00:1c.4: Adding to iommu group 6
[ 0.626296] pci 0000:00:1d.0: Adding to iommu group 7
[ 0.626348] pci 0000:00:1f.0: Adding to iommu group 8
[ 0.626359] pci 0000:00:1f.2: Adding to iommu group 8
[ 0.626368] pci 0000:00:1f.3: Adding to iommu group 8
[ 0.626378] pci 0000:00:1f.4: Adding to iommu group 8
[ 0.626422] pci 0000:00:1f.6: Adding to iommu group 9
[ 0.626467] pci 0000:02:00.0: Adding to iommu group 10
[ 0.626518] pci 0000:3c:00.0: Adding to iommu group 11
[ 0.626523] DMAR: Intel(R) Virtualization Technology for Directed I/O
After:
[ 0.625808] DMAR: No ATSR found
[ 0.625837] DMAR: dmar0: Using Queued invalidation
[ 0.625841] DMAR: dmar1: Using Queued invalidation
[ 0.626033] pci 0000:00:00.0: Adding to iommu group 0
[ 0.632522] pci 0000:00:02.0: Adding to iommu group 1
[ 0.632568] pci 0000:00:08.0: Adding to iommu group 2
[ 0.632634] pci 0000:00:14.0: Adding to iommu group 3
[ 0.632644] pci 0000:00:14.2: Adding to iommu group 3
[ 0.632684] pci 0000:00:16.0: Adding to iommu group 4
[ 0.632746] pci 0000:00:1c.0: Adding to iommu group 5
[ 0.632797] pci 0000:00:1c.1: Adding to iommu group 6
[ 0.632854] pci 0000:00:1c.4: Adding to iommu group 7
[ 0.632911] pci 0000:00:1d.0: Adding to iommu group 8
[ 0.632966] pci 0000:00:1f.0: Adding to iommu group 9
[ 0.632977] pci 0000:00:1f.2: Adding to iommu group 9
[ 0.632988] pci 0000:00:1f.3: Adding to iommu group 9
[ 0.632998] pci 0000:00:1f.4: Adding to iommu group 9
[ 0.633039] pci 0000:00:1f.6: Adding to iommu group 10
[ 0.633096] pci 0000:02:00.0: Adding to iommu group 11
[ 0.633146] pci 0000:3c:00.0: Adding to iommu group 12
[ 0.633233] DMAR: Intel(R) Virtualization Technology for Directed I/O
Other IOMMU / DMAR related dmesg output is identical between the commits:
[ 0.002741] ACPI: DMAR 0x000000007A545CD8 0000A8 (v01 INTEL NUC6i7KY 00000001 INTL 00000001)
...
[ 0.125612] Kernel command line: BOOT_IMAGE=/boot/drm_intel root=/dev/nvme0n1p2 rootwait fsck.repair=yes intel_iommu=igfx_off ro
...
[ 0.174960] DMAR: Host address width 39
[ 0.174962] DMAR: DRHD base: 0x000000fed90000 flags: 0x0
[ 0.174967] DMAR: dmar0: reg_base_addr fed90000 ver 1:0 cap 1c0000c40660462 ecap 7e3ff0505e
[ 0.174970] DMAR: DRHD base: 0x000000fed91000 flags: 0x1
[ 0.174975] DMAR: dmar1: reg_base_addr fed91000 ver 1:0 cap d2008c40660462 ecap f050da
[ 0.174978] DMAR: RMRR base: 0x0000007a275000 end: 0x0000007a294fff
[ 0.174980] DMAR: RMRR base: 0x0000007b800000 end: 0x0000007fffffff
[ 0.174983] DMAR-IR: IOAPIC id 2 under DRHD base 0xfed91000 IOMMU 1
[ 0.174985] DMAR-IR: HPET id 0 under DRHD base 0xfed91000
[ 0.174987] DMAR-IR: Queued invalidation will be enabled to support x2apic and Intr-remapping.
[ 0.176506] DMAR-IR: Enabled IRQ remapping in x2apic mode
PS. This regression is large enough that one run of CSDof is enough to see whether kernel version is impacted:
./synmark2 OglCSDof
drm/i915: Make i915_vma.flags atomic_t for mutex reduction
drm/i915: Make shrink/unshrink be atomic
are meh.
drm/i915: Whitelist COMMON_SLICE_CHICKEN2
is a possiblity, but my money is on
drm/i915: Force compilation with intel-iommu for CI validation
A run with intel_iommu=off should test that theory, or intel_iommu=igfx_off
and reverting HAX iommu/intel: Ignore igfx_off
We run all tests currently with "intel_iommu=igfx_off" kernel command line
option, and while the author-date in above intel-iommu/igfx_off commits is
within range, their drm-tip repo commit dates are actually from Monday this
week, not from week ago?
The commit is in core-for-CI which is a rebasing tree; on Monday it was rebased to v5.3 so that we could drop some patches. So the commit id will be updated fairly often while it remains in that branch.
> (Why IOMMU perf impact would be SKL GT4e specific?)
My guess at this moment would be that eDRAM feels the hit more significantly. Or that we've just got the caching completely wrong on that sku.
> Also, whereas latest drm-tip kernel shows:
> $ sudo grep mmu /sys/kernel/debug/dri/0/i915_capabilities
> iommu: enabled
>
> There's no such output for the 2019-09-11 "b27acd37b7de" kernel where this
> regression was noticed.
That is a new feature added so that we could easily determine which machines in the farm have iommu enabled.
> There is a difference on kernel IOMMU outputs between these commits though...
>
[snip]
> After:
> [ 0.632522] pci 0000:00:02.0: Adding to iommu group 1
So we definitely enabled iommu on igfx in this range.
> PS. This regression is large enough that one run of CSDof is enough to see
> whether kernel version is impacted:
> ./synmark2 OglCSDof
20+% regression is also in line with some kbl (gt3e iirc) media runs I did.
(Why IOMMU perf impact would be SKL GT4e specific?)
My guess at this moment would be that eDRAM feels the hit more
significantly. Or that we've just got the caching completely wrong on that
sku.
We were supposed to have VT-d disabled from BIOS in all our machines, but apparently that had been enabled when SkullCanyon was in other use for a while. I.e. it was only machine with VT-d enabled.
> 20+% regression is also in line with some kbl (gt3e iirc) media runs I did.
Media, not 3D? (That's more than I saw on SkullCanyon in media test-cases.)
I've now enabled VT-d on few other machines (BDW GT2, BXT, SKL GT2, KBL GT3e) to get you a bit more perf info. I'll add that info here later this week.
2-3% 8-bit, max FullHD, FFmpeg/MediaSDK GPU transcode/downscale
[1] With June user-space. With latest Mesa, Fill* & write tests drop is only 3%, and TexMem512 perf somehow improves by 2%. Latest Mesa is several percent faster than older one in these tests, due to Mesa slice/subslice balance optimization, no idea how that can reduce impact of IOMMU.
[2] With June user-space. With latest user-space, drop in these specific tests is half of that, or less. For fullscreen Triangle case, potentially relevant user-space change could be latest X server disabling atomic commits. See: xorg/xserver#888 (closed)
With the June user-space, there are some differences in how much performance drops, but nothing major like with GT3e & GT4e (where slice/subslice issue balance had noticeable impact).
BXT J4205
---------
Results similar to other devices (not reported here as this has higher variances than them). Similarly to SKL GT2, user-space version doesn't have significant impact on how much IOMMU regresses performance.
BDW GT2
-------
As expected, no impact (kernel skips IOMMU for BDW).
Summary
-------
* IOMMU can lose up to third of performance in worst synthetic case, and 5-15% in real GPU (3d/Media) use-cases.
* Seems that badly balanced sub/slice utilization could have noticeable impact in IOMMU performance impact for some use-cases.
@ysainan I haven't seen any indication that situation would have improved (and I assume part of it is HW related). If you can provide some, I can do testing.
@ysainan, why you keep pinging this, is there some reason why the situation would have changed?
I don't have any newer HW to test whether they would have HW mitigations to lower performance impact of IOMMU.
I would suggest somebody from kernel team to test this with single use-case and if there's some actual improvement visible, I can do testing with full set of benchmarks and more HW.
It is likely this is almost purely related to TLB (the IOMMU one) trashing and can be improved dramatically by using transparent huge pages. i915 patch to re-enable them is quite simple, but I sadly don't have access to any automated performance testing suite. All I can say is I tested locally on a low power Skylake GT2 part and for instance regression on OglVSTangent goes from ~14% to 2-3%.
There is a possibility things could be further improved on some benchmarks by triggering less DMA unmap (hence TLB flushes) from i915 but at least for the above benchmark that wasn't the case.
Patch to test with, if someone has the capability (@eero-t ? ;)), looks like this:
diff --git a/drivers/gpu/drm/i915/gem/i915_gemfs.c b/drivers/gpu/drm/i915/gem/i915_gemfs.cindex 5e6e8c91ab38..afbea13b6cd0 100644--- a/drivers/gpu/drm/i915/gem/i915_gemfs.c+++ b/drivers/gpu/drm/i915/gem/i915_gemfs.c@@ -15,6 +15,7 @@ int i915_gemfs_init(struct drm_i915_private *i915) { struct file_system_type *type; struct vfsmount *gemfs;+ char opts[] = "huge=within_size"; type = get_fs_type("tmpfs"); if (!type)@@ -29,7 +30,7 @@ int i915_gemfs_init(struct drm_i915_private *i915) * Currently unused due to bandwidth issues (slow reads) on Broadwell+. */- gemfs = kern_mount(type);+ gemfs = vfs_kern_mount(type, SB_KERNMOUNT, type->name, opts); if (IS_ERR(gemfs)) return PTR_ERR(gemfs);
P.S. In case THP turns out to be a big win for intel_iommu=on, it would be very handy to have results with and without this patch, with IOMMU off as well. Because I am told there was a performance issue on some platforms which was the reason we turned off THP.
Patch to test with, if someone has the capability (@eero-t ? ;))
@tursulin If you provide the change as separate patch or commit, I could do some more testing for it.
Currently my test setup has only GEN9 machines and one TGL (GT1) device. I have no discrete GPUs, but I should be able to loan a DG1 for few days.
I would start by doing testing with full 3D benchmark set on SkullCanyon (SKL GT4e) as that large CSDof (3d compute) case drop on it was especially interesting. Large drops in GPU (onscreen) writes with IOMMU would be next thing to check as it happened for all devices.
Also, I assume DG1 is not very interesting since IOMMU would be involved much less in any bandwidth intensive operations given the data set is supposed to me in device local memory. But Gen9 and TGL would definitely be interesting.
Patch built fine. GEN9 & TGL are better as those I can access right away.
All devices I'm testing have VT-d disabled from BIOS, so I guess I need to ask somebody at the office to enable that for the test devices (earlier devices had "intel_iommu=igfx_off" to avoid the slowdown / instability, but that does not seem to be used now that all have VT-d disabled in BIOS).
Is there anything else that needs to be changed / enabled in BIOS transparent hugepages? What about kernel command line?
And are these options enough in kernel config?
$ grep -e IOMMU -e HUGEkconfig/default:# CONFIG_GART_IOMMU is not setkconfig/default:# CONFIG_CALGARY_IOMMU is not setkconfig/default:CONFIG_IOMMU_API=ykconfig/default:CONFIG_IOMMU_SUPPORT=ykconfig/default:# Generic IOMMU Pagetable Supportkconfig/default:# CONFIG_IOMMU_DEBUGFS is not setkconfig/default:# CONFIG_IOMMU_DEFAULT_PASSTHROUGH is not setkconfig/default:CONFIG_IOMMU_IOVA=ykconfig/default:# CONFIG_AMD_IOMMU is not setkconfig/default:CONFIG_INTEL_IOMMU=ykconfig/default:# CONFIG_INTEL_IOMMU_SVM is not setkconfig/default:CONFIG_INTEL_IOMMU_DEFAULT_ON=ykconfig/default:CONFIG_INTEL_IOMMU_FLOPPY_WA=y...CONFIG_CGROUP_HUGETLB=yCONFIG_ARCH_WANT_HUGE_PMD_SHARE=yCONFIG_ARCH_WANT_GENERAL_HUGETLB=yCONFIG_ARCH_ENABLE_HUGEPAGE_MIGRATION=yCONFIG_HAVE_ARCH_TRANSPARENT_HUGEPAGE=yCONFIG_HAVE_ARCH_TRANSPARENT_HUGEPAGE_PUD=yCONFIG_HAVE_ARCH_HUGE_VMAP=yCONFIG_TRANSPARENT_HUGEPAGE=yCONFIG_TRANSPARENT_HUGEPAGE_ALWAYS=y# CONFIG_TRANSPARENT_HUGEPAGE_MADVISE is not setCONFIG_TRANSPARENT_HUGE_PAGECACHE=yCONFIG_HUGETLBFS=yCONFIG_HUGETLB_PAGE=y
What I should see in dmesg when everything is working as expected?
For the hugepages side of things in dmesg you should see "Transparent Hugepage mode 'huge=within_size'" and absence of "Unable to create a private tmpfs mount, hugepage support will be disabled". Both come from i915.
And for the IOMMU side "i915 device info: iommu: enabled". (There's a bunch of other dmesg lines containing either iommu or dmar which appear when thing is active but that one should be sufficient indicator.)
Correct message to check for was: "[drm] VT-d active for gfx access".
Performance drops from enabling "VT-d" & "Intel Virtualization Technology" & "intel_iommu=on" are following on SkullCanyon (SKL GT4e) with current drm-tip:
(Impacts are rough estimates because tests were not run in same order, and each test was run only 3-5 times. I'm listing only tests which results are known to be somewhat stable though.)
Summary so far:
THB (patch) significantly helps most of the tests regressing with GFX IOMMU
There are still significant perf gaps compared to not using IOMMU
That regression might explain some of the remaining gaps, e.g. why THB did not decrease GLB Egypt/T-Rex IOMMU perf gap at all
[1] This is a known, old hugepages issue. Assumption was that use of hugepages somehow interferes with GPU cache coloring, and as result, workloads which are bandwidth bound (especially ones that quite not fit into cache), can suffer from it.
Tomorrow I'll check whether removing "intel_iommu=on" has any impact on SkullCanyon perf, and check how much Unigine demos regress without the patch. I guess I also need to check whether GEN12 (TGL) also suffers from the same hugepages regression...
EDIT: added Valley/Heaven/AztecRuins perf impact, and moved BXT info to next comment with other BXT data.
THB patch is a large improvement to IOMMU, but there's still clear gap to non-IOMMU perf
Additional perf regressions with THB are in MemBW GPU texture (7%) and SynMark TexMem* (2-3%) tests
Note: above changes include 2 days of gfxstack updates. Typically there are 0 perf changes, but one never knows. Results correspond to SkullCanyon results though, and as expected regressions from hugepages are much smaller on Atoms (which miss LLC & eDRAM caches, and rest of their caches are much smaller).