Due to an influx of spam, we have had to impose restrictions on new accounts. Please see this wiki page for instructions on how to get full permissions. Sorry for the inconvenience.
Admin message
The migration is almost done, at least the rest should happen in the background. There are still a few technical difference between the old cluster and the new ones, and they are summarized in this issue. Please pay attention to the TL:DR at the end of the comment.
Project 'drm/intel' was moved to 'drm/i915/kernel'. Please update any links and bookmarks that may still have the old path.
Issues in later 3D benchmarks after first GPU hang [incomplete recovery]
[ 1142.611537] i915 0000:00:02.0: Resetting rcs0 for preemption time out[ 1142.611562] i915 0000:00:02.0: heaven_x64[2087] context reset due to GPU hang[ 1142.660890] i915 0000:00:02.0: GPU HANG: ecode 9:1:84df9ffc, in heaven_x64 [2087][ 1142.660893] GPU hangs can indicate a bug anywhere in the entire gfx stack, including userspace.[ 1142.660894] Please file a _new_ bug report on bugs.freedesktop.org against DRI -> DRM/Intel[ 1142.660894] drm/i915 developers can then reassign to the right component if it's not a kernel issue.[ 1142.660895] The GPU crash dump is required to analyze GPU hangs, so please always attach it.[ 1142.660896] GPU crash dump saved to /sys/class/drm/card0/error
Notes:
This issue doesn't show up with Mesa i965 driver, only Iris one
Hang and the issues in following 3D benchmarks happen on all HW I have; most frequently on GLK, then BXT, then on other, faster HW. No idea whether it happens also with other than GEN9 HW (don't have any)
No idea whether this is a regression, it seems to have started few weeks ago when Mesa switched to Iris by default, and I don't have earlier data from running Heaven with Iris on GEN9
I'm filing this to kernel because I see some odd behavior which happens only after Heaven GPU hang recovery:
Sometimes there are bogus (impossibly high) FPS results in all the following 3D benchmarks
Recoverable GPU hang in later run SynMark Batch6 test on BXT, and GpuTest Piano test on GLK
Later run GfxBench Manhattan tests slow down enough to hit test timeouts
Attached are GPU hang states from drm-tip v5.5-rc7:
SynMark2 Batch6 hang after Unigine Heaven hang happens sometimes also on GLK, it's not BXT specific. Because there's no hang with other Batch* tests (ones with smaller or larger draw batch sizes), I'm pretty sure the trigger is timing related.
GpuTest Piano hang I haven't seen on BXT, but Piano is a really slow test and GLK iGPU is slower (12EUs) than BXT (18EUs), so it's possible that hangcheck triggers naturally on GLK => let's ignore Piano hang.
If you need at any point new error states from latest drm-tip, just ask. I get several new error state files every night. :-)
So chief suspect is that this is just a super slow batch and being caught by the 640ms preemption timeout.
To confirm that I think we want to pull in some of Tvrtko's infrastructure to track context runtime; from which we can spit out how long this context has been active at the time of reset and with some handwaving approximate that to batch duration.
So chief suspect is that this is just a super slow batch and being caught by the 640ms preemption timeout.
I haven't seen such issue in Heaven earlier, and in Batch6 case it would definitely need some big other kernel bug (if one of the equal sized batch6 draw calls is too slow, all the earlier batch tests, starting from batch0, should also have triggered it), e.g. with power management (Batch4, Batch5 and Batch6 are where halved mesh size / doubled draw count causes Batch tests to change from being more GPU bound to being more CPU bound).
To confirm that I think we want to pull in some of Tvrtko's infrastructure to track context runtime; from which we can spit out how long this context has been active at the time of reset and with some handwaving approximate that to batch duration.
From a trace tracking GPU RC6 values and ftracing SwapBuffer calls, I can see that the hang issue happens after Heaven has been running about 100s, RC6 goes from 1-2% to zero and buffer swaps stop for few seconds. After than point, Heaven buffer swaps happen 10x faster than they should (= CPU limit), with RC6 keeping constantly at around 60% instead of zero, also on further Heaven runs.
When looking at Iris driver trace where there's no hang... RC6 is 1-2%, with occasional spikes up to 20% (out of 1s interval) due to shader compiles. Heaven's slowest frame is ~600ms (on BXT, with those Heaven settings) at that point, but from API call statistics for a whole Heaven run, I can see that Heaven does on average ~1600 draw calls per frame. I.e. when things are working normally, single Heaven draw/batch being slower than 640ms doesn't seem likely without some other problem.
In Batch6, RC6 keeps at 0% when there has been no hang in Heaven. If Heaven has hanged before Batch6, GPU alternates between 0% and 20% with FPS being higher, but capped by CPU speed. During hang/recovery, RC6 & FPS are naturally zero.
Chris Wilsonchanged title from [BXT/GLK] Unigine Heaven GPU hang with Mesa Iris + issues in later 3D benchmarks to [BXT/GLK] Unigine Heaven GPU hang with Mesa Iris, then issues in later 3D benchmarks [incomplete recovery]
changed title from [BXT/GLK] Unigine Heaven GPU hang with Mesa Iris + issues in later 3D benchmarks to [BXT/GLK] Unigine Heaven GPU hang with Mesa Iris, then issues in later 3D benchmarks [incomplete recovery]
Eero Tamminenchanged title from [BXT/GLK] Unigine Heaven GPU hang with Mesa Iris, then issues in later 3D benchmarks [incomplete recovery] to [GEN9] Unigine Heaven GPU hang with Mesa Iris, then issues in later 3D benchmarks [incomplete recovery]
changed title from [BXT/GLK] Unigine Heaven GPU hang with Mesa Iris, then issues in later 3D benchmarks [incomplete recovery] to [GEN9] Unigine Heaven GPU hang with Mesa Iris, then issues in later 3D benchmarks [incomplete recovery]
I'm seeing the same issue also on SKL GT2, so it's not Atom specific, but more generic issue. Attached is error state with last evening Git kernel & Mesa: skl-i5-gt2-i915_error_state.txt => please add at least SKL label
From the old data, I can see that SKL GT2 did the same thing also with Jan 27th drm-tip kernel & Mesa, right after Mesa switched to Iris: Unigine Heaven GPU hang followed by GfxBench Manhattan tests timeouts, SynMark Batch6 GPU hang, and bogus FPS values in other tests.
I've now enabled Iris for slightly larger set of tests, with some variations in what git versions of the driver stack (kernel, Mesa, Xorg, Weston) are tested together, with what tests and in which order the tests are run.
And I'm seeing GPU hangs in these same tests on BXT, GLK, SKL GT2 and CML GT2. Only device where I'm not seeing them is SKL "SkullCanyon" GT4e (that has just "Atomic update failure on pipe A" errors during SynMark FillPixel & FillTexSingle tests).
=> Somebody please add SKL & CML to the platform labels!
In addition to earlier GPU hangs that happened on:
Unigine Heaven
GpuTest Piano, and
SynMark Batch6 tests
With the added larger drm-tip v5.6 + Mesa Iris test coverage, I'm now seeing GPU hangs also in following tests:
All of our (very low FPS) internal memory bandwidth tests
Nowadays these GPU hangs can happen without Heaven hanging first though, and GfxBench Manhattan tests don't anymore always fail to timeout after Heaven has hanged.
Last week the Heaven GPU hang and bogus FPS (way too high) values in (windowed X) tests following it, showed up also on SKL GT4e (which runs Weston instead of X, and slightly different set of tests than the other devices).
=> How often this issue happens, seems to inversely correspond to how much GPU capability device has (possibly in comparison to CPU side, not necessarily as absolutely). I.e. it's some kind of timing bug on kernel side.
=> Bogus FPS issue concerns now all devices I have, which ruins performance trends for them. Data can't be scaled reasonably when it includes insanely high bogus values at random intervals.
Because I have only GEN9 devices, I can't say whether this affects also other GENs, but I would think that very likely as it affects all my devices.
This issue has been there at least since January (although I noticed it only few months ago), when somebody's going to have time to look at it? I can provide test-cases, and SW & HW setup on which to debug it.
Today's SkullCanyon issue looks a bit worse (it hadn't been running for a while, so I don't know how long the extra wakeup issues have been happening):
After moving Heaven to be run as last, the first 3D test to trigger GPU hangs on SKL GT2 seems now to be GpuTest Tessalation:
[ 3388.626440] Iteration 3/3: GpuTest /test=tess_x64 /width=1366 /height=768 /msaa=1 /no_scorebox /benchmark /benchmark_duration_ms=35000[ 3423.118838] i915 0000:00:02.0: [drm] Resetting rcs0 for preemption time out[ 3423.118863] i915 0000:00:02.0: [drm] GpuTest[4687] context reset due to GPU hang[ 3423.126104] i915 0000:00:02.0: [drm] GPU HANG: ecode 9:1:85df1ecf, in GpuTest [4687][ 3423.126104] GPU hangs can indicate a bug anywhere in the entire gfx stack, including userspace.[ 3423.126105] Please file a _new_ bug report at https://gitlab.freedesktop.org/drm/intel/issues/new.[ 3423.126105] Please see https://gitlab.freedesktop.org/drm/intel/-/wikis/How-to-file-i915-bugs for details.[ 3423.126105] drm/i915 developers can then reassign to the right component if it's not a kernel issue.[ 3423.126106] The GPU crash dump is required to analyze GPU hangs, so please always attach it.[ 3423.126106] GPU crash dump saved to /sys/class/drm/card0/error
GPU hang / recovery results in bogus FPS for all the following 3D tests, i.e. it's incomplete recovery, same as with Heaven. However, this should be easier to debug than the very complex Heaven test, because it's a simple synthetic test.
See the attached error info from two different drm-tip kernel versions (from early this month, and from yesterday). Both hangs happen with the same (nearly month old) Git version of Mesa and X server:
On GLK, where hangs with incomplete recovery happen almost daily, it's still being triggered by GfxBench Carchase. Both CarChase and Heaven use tessellation, so there's some possibility that kernel's incomplete GPU recovery with Iris is related to some difference in how Mesa i965 & Iris drivers handle tessellation.
Eero Tamminenchanged title from [GEN9] Unigine Heaven GPU hang with Mesa Iris, then issues in later 3D benchmarks [incomplete recovery] to [GEN9] GPU hang in tessellating 3D benchmark with Mesa Iris, then issues in later 3D benchmarks [incomplete recovery]
changed title from [GEN9] Unigine Heaven GPU hang with Mesa Iris, then issues in later 3D benchmarks [incomplete recovery] to [GEN9] GPU hang in tessellating 3D benchmark with Mesa Iris, then issues in later 3D benchmarks [incomplete recovery]
With GpuTest running before GfxBench, incomplete recovery happens already with GPU hang GpuTest piano (runs before tessellation test), which doesn't have any tessellation shaders. I.e. this wasn't related to tessellation shaders after all.
And it's run like this:
./GpuTest /test=pixmark_piano /width=1366 /height=768 /msaa=1 /no_scorebox /benchmark /benchmark_duration_ms=35000
EDIT: moving GpuTest Tessalation test before the Piano one, still doesn't trigger GPU hang with it. I.e. Tessellation test triggers the GPU hang only only on SKL GT2, Piano on GLK. It seems to be just timing dependent?
Eero Tamminenchanged title from [GEN9] GPU hang in tessellating 3D benchmark with Mesa Iris, then issues in later 3D benchmarks [incomplete recovery] to [GEN9] issues in later 3D benchmarks after first GPU hang [incomplete recovery]
changed title from [GEN9] GPU hang in tessellating 3D benchmark with Mesa Iris, then issues in later 3D benchmarks [incomplete recovery] to [GEN9] issues in later 3D benchmarks after first GPU hang [incomplete recovery]
Drm-tip has been so broken since weekend that I don't have much data for last few days:
Suspend Oopses on all machines
X fails to start (especially with Intel HadesCanyon i.e. AMDGPU)
And before that there were many days when Mesa was crashing X.
Anyway, since moving Heaven to be run as last 3D test (in June), situation has been following:
CML-H GT2: only one hang in July
BXT J4205 with ClearLinux + Weston: last hangs in July
BXT J4205 with Ubuntu 20.04 + X/compiz: last hangs in July
SKL-i5 GT2 (6600K): still hangs every other week on average
SKL-i7 GT4e (6770HQ): still hangs weekly on average
GLK J4005: when Heaven is run, still hangs every day, otherwise every 2nd-4th day
=> GLK very much broken, SKL somewhat broken, BXT & CML appear to work nowadays (but that could be just good luck with timings)
As already stated (at least in GLK case), incomplete recovery doesn't require complex & slow Heaven benchmark, it happens already with GpuTest v0.70 Piano benchmark:
[ 566.353886] Iteration 1/3: GpuTest /test=pixmark_piano /width=1366 /height=768 /msaa=1 /no_scorebox /benchmark /benchmark_duration_ms=35000[ 569.936547] i915 0000:00:02.0: [drm] Resetting rcs0 for preemption time out[ 569.936575] i915 0000:00:02.0: [drm] GpuTest[2518] context reset due to GPU hang[ 569.949914] i915 0000:00:02.0: [drm] GPU HANG: ecode 9:1:85dffdfb, in GpuTest [2518][ 569.949918] GPU hangs can indicate a bug anywhere in the entire gfx stack, including userspace.[ 569.949919] Please file a _new_ bug report at https://gitlab.freedesktop.org/drm/intel/issues/new.[ 569.949920] Please see https://gitlab.freedesktop.org/drm/intel/-/wikis/How-to-file-i915-bugs for details.[ 569.949921] drm/i915 developers can then reassign to the right component if it's not a kernel issue.[ 569.949922] The GPU crash dump is required to analyze GPU hangs, so please always attach it.[ 569.949923] GPU crash dump saved to /sys/class/drm/card0/error[ 606.956243] Iteration 2/3: GpuTest /test=pixmark_piano /width=1366 /height=768 /msaa=1 /no_scorebox /benchmark /benchmark_duration_ms=35000[ 609.104474] i915 0000:00:02.0: [drm] Resetting rcs0 for preemption time out[ 609.104504] i915 0000:00:02.0: [drm] GpuTest[2585] context reset due to GPU hang[ 609.118963] i915 0000:00:02.0: [drm] GPU HANG: ecode 9:1:85dffdfb, in GpuTest [2585][ 647.559004] Iteration 3/3: GpuTest /test=pixmark_piano /width=1366 /height=768 /msaa=1 /no_scorebox /benchmark /benchmark_duration_ms=35000
After GPU hangs, rest of the 3D benchmark results are all completely bogus. From the Piano results, I can see that the first GPU hang is enough to mess up i915, same as with Heaven.
On other machines than GLK (which is slower than the other machines with its measly 12 EUs), GPU hangs seems to nowadays require running Heaven. Attached are Heaven hang error states for:
Note that my testing is done with Mesa and X server git versions.
Because this is an old bug, e.g. Ubuntu 20.10 has new enough version of Mesa, but even latest X server release version doesn't enable DMA-buf / modifier (i.e. end-to-end render buffer compression) support by default, although it's been enabled in X server git already for few years.
One more thing that could affect this is powersave/ondemand. I don't see this issue (on BXT) with ClearLinux user-space which uses performance governor instead of the Ubuntu default powersave one.
I thought that this doesn't happen any more on all GEN9 HW (just occasionally on SKL GT2 / GT4e, and every time on GLK & BXT), but few days ago there was first CML-H Heaven hang in half a year (with drm-tip v5.11.0-rc3): i915_error_state-heaven-cml-h.txt
PS. There isn't any indication that this is GEN9 specific, I just don't have any other HW for these tests.
For last few days I've been running TGL, and I've seen somewhat similar kind of issue on it too after GPU hangs (#3170 (moved) and #3171 (closed)). I.e. this doesn't seem to be GEN9 specific, but could be a generic issue.
The odd thing is that several tests after the hang can go fine, but few later tests can have totally bogus FPS values (>10x what they should be).
After TGL Manhattan 3.1 hang, few tests following it will have completely bogus FPS: GfxBench Tessellation & T-Rex benchmarks, and often also (GLES/EGL/X) GLB Egypt benchmark run directly after them. Issue happens both with GL/GLX/X and GLES/EGL/Wayland versions of GfxBench.
Rest of the GLB tests don't seem to be impacted by the issue on TGL, and neither are (GL/GLX/X) GpuTest tests (or e.g. Shoc compute tests). Unigine and SynMark tests are impacted by it, but only GL/GLX/X version of SynMark, not GLES/EGL/Wayland one.
Of the SynMark tests, following ones don't seem to be impacted by the bogus FPS (incomplete recovery) issue on TGL:
Batch1 (but the other GPU bound Batch[023] tests are)
Batch[567] (CPU bound)
CSCloth
GSCloth (CPU bound?)
DrvState (CPU bound?)
DrvShComp (CPU bound)
DeferredAA (but Deferred is)
FillTexSingle (but FillPixel is)
GeomPoint
PSPhong (but other PS* tests are)
TexMem128 (but TexMem512 is)
Because above list doesn't have any logic, and it seems random when rest of the tests exhibit the bogus FPS issue, I think triggering of the bogus FPS issue is just timing dependent.
@ickle, any idea why/how GPU hang could start triggering such a thing and why it would not happen to all processes, could it be e.g. some kind of race-condition when context is initialized?
Note: Sometimes similar issue is triggered for SynMark tests without any GfxBench tests being run first, but at which point it happens, seems to be somewhat random, so I haven't been able to pinpoint (from perf data) what SynMark tests are triggering it.
Eero Tamminenchanged title from [GEN9] issues in later 3D benchmarks after first GPU hang [incomplete recovery] to Issues in later 3D benchmarks after first GPU hang [incomplete recovery]
changed title from [GEN9] issues in later 3D benchmarks after first GPU hang [incomplete recovery] to Issues in later 3D benchmarks after first GPU hang [incomplete recovery]
Most of the GPU hangs on TGL-H (GT1) in SynMark DrvState are not properly recovered. Either the following tests will also fail, or some of them report completely bogus FPS.
Attached is one of the latter ones (Mesa from few days ago, DrvState run under Xwayland, drm-tip kernel v5.11): i915_error_state-tgl-drvstate.txt
It may have gone from TGL (but not from BXT/GLK) with latest drm-tip kernel, let's see.
I think this issue was fixed in drm-tip around 20th of July.
I haven't seen this issue on other platforms than GLK for a while, and with drm-tip of that day, on GLK:
Number of GPU hangs increased sharply. Now about every run of Unigine Heaven (3D), GpuTest Piano (3D) or Shoc (CL) causes GPU hang message in dmesg
Runs with large number of these benchmarks complete faster (they don't anymore hit timeouts I've set)
Benchmarks stopped reporting ridiculously high FPSes (10-100x larger than is realistic)
Note: Getting more GPU hangs is not an issue as those affect only individual tests. Last one was a huge problem though, as hang recovery issues completely ruin performance tracking for everything.