Issues in later 3D benchmarks after first GPU hang [incomplete recovery]

changed the description

The error states look more convincing at last. Might be passed kernel misbehaviour.

SynMark2 Batch6 hang after Unigine Heaven hang happens sometimes also on GLK, it's not BXT specific. Because there's no hang with other Batch* tests (ones with smaller or larger draw batch sizes), I'm pretty sure the trigger is timing related.

GpuTest Piano hang I haven't seen on BXT, but Piano is a really slow test and GLK iGPU is slower (12EUs) than BXT (18EUs), so it's possible that hangcheck triggers naturally on GLK => let's ignore Piano hang.

If you need at any point new error states from latest drm-tip, just ask. I get several new error state files every night. :-)

So chief suspect is that this is just a super slow batch and being caught by the 640ms preemption timeout.

To confirm that I think we want to pull in some of Tvrtko's infrastructure to track context runtime; from which we can spit out how long this context has been active at the time of reset and with some handwaving approximate that to batch duration.

added platform: BXT platform: GLK labels

So chief suspect is that this is just a super slow batch and being caught by the 640ms preemption timeout.

I haven't seen such issue in Heaven earlier, and in Batch6 case it would definitely need some big other kernel bug (if one of the equal sized batch6 draw calls is too slow, all the earlier batch tests, starting from batch0, should also have triggered it), e.g. with power management (Batch4, Batch5 and Batch6 are where halved mesh size / doubled draw count causes Batch tests to change from being more GPU bound to being more CPU bound).

To confirm that I think we want to pull in some of Tvrtko's infrastructure to track context runtime; from which we can spit out how long this context has been active at the time of reset and with some handwaving approximate that to batch duration.

From a trace tracking GPU RC6 values and ftracing SwapBuffer calls, I can see that the hang issue happens after Heaven has been running about 100s, RC6 goes from 1-2% to zero and buffer swaps stop for few seconds. After than point, Heaven buffer swaps happen 10x faster than they should (= CPU limit), with RC6 keeping constantly at around 60% instead of zero, also on further Heaven runs.

When looking at Iris driver trace where there's no hang... RC6 is 1-2%, with occasional spikes up to 20% (out of 1s interval) due to shader compiles. Heaven's slowest frame is ~600ms (on BXT, with those Heaven settings) at that point, but from API call statistics for a whole Heaven run, I can see that Heaven does on average ~1600 draw calls per frame. I.e. when things are working normally, single Heaven draw/batch being slower than 640ms doesn't seem likely without some other problem.

In Batch6, RC6 keeps at 0% when there has been no hang in Heaven. If Heaven has hanged before Batch6, GPU alternates between 0% and 20% with FPS being higher, but capped by CPU speed. During hang/recovery, RC6 & FPS are naturally zero.

mentioned in issue mesa/mesa#1882 (closed)

changed title from [BXT/GLK] Unigine Heaven GPU hang with Mesa Iris + issues in later 3D benchmarks to [BXT/GLK] Unigine Heaven GPU hang with Mesa Iris, then issues in later 3D benchmarks [incomplete recovery]

added severity::major label

mentioned in issue mesa/mesa#2541

changed title from [BXT/GLK] Unigine Heaven GPU hang with Mesa Iris, then issues in later 3D benchmarks [incomplete recovery] to [GEN9] Unigine Heaven GPU hang with Mesa Iris, then issues in later 3D benchmarks [incomplete recovery]

I'm seeing the same issue also on SKL GT2, so it's not Atom specific, but more generic issue. Attached is error state with last evening Git kernel & Mesa: skl-i5-gt2-i915_error_state.txt => please add at least SKL label

From the old data, I can see that SKL GT2 did the same thing also with Jan 27th drm-tip kernel & Mesa, right after Mesa switched to Iris: Unigine Heaven GPU hang followed by GfxBench Manhattan tests timeouts, SynMark Batch6 GPU hang, and bogus FPS values in other tests.

If Heaven is skipped, all other tests work fine.

I've now enabled Iris for slightly larger set of tests, with some variations in what git versions of the driver stack (kernel, Mesa, Xorg, Weston) are tested together, with what tests and in which order the tests are run.

And I'm seeing GPU hangs in these same tests on BXT, GLK, SKL GT2 and CML GT2. Only device where I'm not seeing them is SKL "SkullCanyon" GT4e (that has just "Atomic update failure on pipe A" errors during SynMark FillPixel & FillTexSingle tests).

=> Somebody please add SKL & CML to the platform labels!

In addition to earlier GPU hangs that happened on:

Unigine Heaven
GpuTest Piano, and
SynMark Batch6 tests

With the added larger drm-tip v5.6 + Mesa Iris test coverage, I'm now seeing GPU hangs also in following tests:

All of our (very low FPS) internal memory bandwidth tests
GpuTest Volplosion (very low FPS) & GiMark (highish FPS)
SynMark Batch4 & DeferredAA
Sacha Willems' Vulkan Raytracing & N-body tests

Nowadays these GPU hangs can happen without Heaven hanging first though, and GfxBench Manhattan tests don't anymore always fail to timeout after Heaven has hanged.

added platform: CML platform: SKL labels

Last week the Heaven GPU hang and bogus FPS (way too high) values in (windowed X) tests following it, showed up also on SKL GT4e (which runs Weston instead of X, and slightly different set of tests than the other devices).

=> How often this issue happens, seems to inversely correspond to how much GPU capability device has (possibly in comparison to CPU side, not necessarily as absolutely). I.e. it's some kind of timing bug on kernel side.

=> Bogus FPS issue concerns now all devices I have, which ruins performance trends for them. Data can't be scaled reasonably when it includes insanely high bogus values at random intervals.

Because I have only GEN9 devices, I can't say whether this affects also other GENs, but I would think that very likely as it affects all my devices.

This issue has been there at least since January (although I noticed it only few months ago), when somebody's going to have time to look at it? I can provide test-cases, and SW & HW setup on which to debug it.

SkullCanyon error state from Heaven hang: skl_gt4e_heaven_i915_error_state.txt.gz

Today's SkullCanyon issue looks a bit worse (it hadn't been running for a while, so I don't know how long the extra wakeup issues have been happening):

[ 1211.718874] Iteration 3/3: bin/heaven_x64 -project_name Heaven -data_path ../ -engine_config ../data/heaven_4.0.cfg -system_script heaven/unigine.cpp -video_app opengl -sound_app null -video_mode -1 -video_fullscreen 1 -video_multisample 0 -video_width 1920 -video_height 1080 -extern_define ,BENCHMARK,RELEASE,LANGUAGE_EN,QUALITY_HIGH,TESSELLATION_ENABLED
[ 1233.525782] i915 0000:00:02.0: [drm] Resetting rcs0 for preemption time out
[ 1233.525806] i915 0000:00:02.0: [drm] heaven_x64[1534] context reset due to GPU hang
[ 1233.550781] i915 0000:00:02.0: [drm] GPU HANG: ecode 9:1:84df7ec4, in heaven_x64 [1534]
[ 1233.550782] GPU hangs can indicate a bug anywhere in the entire gfx stack, including userspace.
[ 1233.550783] Please file a _new_ bug report at https://gitlab.freedesktop.org/drm/intel/issues/new.
[ 1233.550783] Please see https://gitlab.freedesktop.org/drm/intel/-/wikis/How-to-file-i915-bugs for details.
[ 1233.550783] drm/i915 developers can then reassign to the right component if it's not a kernel issue.
[ 1233.550784] The GPU crash dump is required to analyze GPU hangs, so please always attach it.
[ 1233.550784] GPU crash dump saved to /sys/class/drm/card0/error
[ 1267.717857] ------------[ cut here ]------------
[ 1267.717864] WARNING: CPU: 5 PID: 0 at kernel/sched/core.c:2388 ttwu_queue_wakelist+0xbd/0xd0
[ 1267.717864] Modules linked in: fuse snd_hda_codec_hdmi snd_hda_codec_realtek snd_hda_codec_generic ledtrig_audio i915 snd_hda_intel snd_intel_dspcfg snd_hda_codec x86_pkg_temp_thermal snd_hwdep coretemp snd_hda_core crct10dif_pclmul crc32_pclmul e1000e snd_pcm mei_me i2c_i801 i2c_smbus mei sunrpc ip_tables
[ 1267.717875] CPU: 5 PID: 0 Comm: swapper/5 Tainted: G     U            5.8.0-rc4-CI-Nightly_2183+ #1
[ 1267.717876] Hardware name:  /NUC6i7KYB, BIOS KYSKLi70.86A.0055.2018.0516.1629 05/16/2018
[ 1267.717878] RIP: 0010:ttwu_queue_wakelist+0xbd/0xd0
[ 1267.717880] Code: 15 06 00 b8 01 00 00 00 5b 5d 41 5c 41 5d c3 31 c0 c3 31 c0 40 f6 c5 08 74 ee 48 c7 c2 00 a3 02 00 83 7c 16 04 01 77 e0 eb 85 <0f> 0b 31 c0 eb d8 66 66 2e 0f 1f 84 00 00 00 00 00 66 90 0f 1f 44
[ 1267.717881] RSP: 0018:ffffc900001c4e38 EFLAGS: 00010046
[ 1267.717882] RAX: 0000000000000005 RBX: ffff88846b081c80 RCX: 0000000000000000
[ 1267.717883] RDX: 000000000002a300 RSI: ffffffff8239dcad RDI: ffffffff8236d9de
[ 1267.717884] RBP: 00000000fffffffb R08: 0000000000000000 R09: 0000000000000005
[ 1267.717885] R10: ffff88846b080ed0 R11: ffff88846ed6a300 R12: 0000000000000005
[ 1267.717886] R13: 0000000000000005 R14: ffff88846b082324 R15: 000000000002a300
[ 1267.717886] FS:  0000000000000000(0000) GS:ffff88846ed40000(0000) knlGS:0000000000000000
[ 1267.717887] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 1267.717888] CR2: 00007f811d69c000 CR3: 000000000560a002 CR4: 00000000003606e0
[ 1267.717888] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[ 1267.717889] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[ 1267.717889] Call Trace:
[ 1267.717891]  <IRQ>
[ 1267.717893]  try_to_wake_up+0x1c4/0x4b0
[ 1267.717895]  ? try_to_wake_up+0x7a/0x4b0
[ 1267.717898]  autoremove_wake_function+0xe/0x30
[ 1267.717919]  __i915_sw_fence_complete+0x143/0x1a0 [i915]
[ 1267.717936]  dma_i915_sw_fence_wake_timer+0x23/0x40 [i915]
[ 1267.717954]  signal_irq_work+0x21d/0x330 [i915]
[ 1267.717957]  irq_work_single+0x2c/0x40
[ 1267.717959]  irq_work_run_list+0x26/0x30
[ 1267.717960]  irq_work_run+0x26/0x40
[ 1267.717962]  __sysvec_irq_work+0x2d/0xf0
[ 1267.717964]  asm_call_on_stack+0x12/0x20
[ 1267.717965]  </IRQ>
[ 1267.717967]  sysvec_irq_work+0x95/0xb0
[ 1267.717968]  asm_sysvec_irq_work+0x12/0x20
[ 1267.717970] RIP: 0010:cpuidle_enter_state+0xc1/0x400
[ 1267.717971] Code: c4 0f 1f 44 00 00 31 ff e8 cc b3 88 ff 80 7c 24 0f 00 74 12 9c 58 f6 c4 02 0f 85 11 03 00 00 31 ff e8 23 c6 8d ff fb 45 85 ed <0f> 88 6d 02 00 00 49 63 d5 4c 2b 64 24 10 48 8d 04 52 48 8d 04 82
[ 1267.717972] RSP: 0018:ffffc900000d3e68 EFLAGS: 00000202
[ 1267.717973] RAX: ffff88846ed40000 RBX: ffff88846ed74208 RCX: 000000000000001f
[ 1267.717973] RDX: 0000000000000000 RSI: ffffffff8239dcad RDI: ffffffff8236d9de
[ 1267.717974] RBP: ffffffff826c9c40 R08: 0000012729df3c56 R09: 000000000000038b
[ 1267.717974] R10: 0000000000000158 R11: ffff88846ed694a4 R12: 0000012729df3c56
[ 1267.717975] R13: 0000000000000004 R14: 0000000000000004 R15: ffff88846ce963c0
[ 1267.717977]  ? cpuidle_enter_state+0xa4/0x400
[ 1267.717978]  cpuidle_enter+0x29/0x40
[ 1267.717980]  do_idle+0x1d6/0x260
[ 1267.717982]  cpu_startup_entry+0x19/0x20
[ 1267.717984]  start_secondary+0x153/0x190
[ 1267.717985]  secondary_startup_64+0xa4/0xb0
[ 1267.717987] ---[ end trace 9d735d85a5ba4378 ]---
[ 1413.877794] i915 0000:00:02.0: [drm] Resetting rcs0 for preemption time out
[ 1413.877824] i915 0000:00:02.0: [drm] heaven_x64[1534] context reset due to GPU hang
[ 1413.890107] i915 0000:00:02.0: [drm] GPU HANG: ecode 9:1:85dfffff, in heaven_x64 [1534]

Error dump attached: i915_error_state.txt

A bit over a week the random Heaven GPU hang had again happened on SKL GT2: i915_error_state.txt

It seems that something in drm-tip v5.7 fixed the hangs for CML-H, but they're still still a daily occurrence on BXT & GLK.

mentioned in issue mesa/mesa#3277 (closed)

mentioned in issue #2180

After moving Heaven to be run as last, the first 3D test to trigger GPU hangs on SKL GT2 seems now to be GpuTest Tessalation:

[ 3388.626440] Iteration 3/3: GpuTest /test=tess_x64 /width=1366 /height=768 /msaa=1 /no_scorebox /benchmark /benchmark_duration_ms=35000
[ 3423.118838] i915 0000:00:02.0: [drm] Resetting rcs0 for preemption time out
[ 3423.118863] i915 0000:00:02.0: [drm] GpuTest[4687] context reset due to GPU hang
[ 3423.126104] i915 0000:00:02.0: [drm] GPU HANG: ecode 9:1:85df1ecf, in GpuTest [4687]
[ 3423.126104] GPU hangs can indicate a bug anywhere in the entire gfx stack, including userspace.
[ 3423.126105] Please file a _new_ bug report at https://gitlab.freedesktop.org/drm/intel/issues/new.
[ 3423.126105] Please see https://gitlab.freedesktop.org/drm/intel/-/wikis/How-to-file-i915-bugs for details.
[ 3423.126105] drm/i915 developers can then reassign to the right component if it's not a kernel issue.
[ 3423.126106] The GPU crash dump is required to analyze GPU hangs, so please always attach it.
[ 3423.126106] GPU crash dump saved to /sys/class/drm/card0/error

GPU hang / recovery results in bogus FPS for all the following 3D tests, i.e. it's incomplete recovery, same as with Heaven. However, this should be easier to debug than the very complex Heaven test, because it's a simple synthetic test.

See the attached error info from two different drm-tip kernel versions (from early this month, and from yesterday). Both hangs happen with the same (nearly month old) Git version of Mesa and X server:

On GLK, where hangs with incomplete recovery happen almost daily, it's still being triggered by GfxBench Carchase. Both CarChase and Heaven use tessellation, so there's some possibility that kernel's incomplete GPU recovery with Iris is related to some difference in how Mesa i965 & Iris drivers handle tessellation.

changed title from [GEN9] Unigine Heaven GPU hang with Mesa Iris, then issues in later 3D benchmarks [incomplete recovery] to [GEN9] GPU hang in tessellating 3D benchmark with Mesa Iris, then issues in later 3D benchmarks [incomplete recovery]

With GpuTest running before GfxBench, incomplete recovery happens already with GPU hang GpuTest piano (runs before tessellation test), which doesn't have any tessellation shaders. I.e. this wasn't related to tessellation shaders after all.

See attached error state: i915_error_state-glk-gputest-piano.txt

GpuTest can be downloaded from here: https://www.geeks3d.com/gputest/

And it's run like this: ./GpuTest /test=pixmark_piano /width=1366 /height=768 /msaa=1 /no_scorebox /benchmark /benchmark_duration_ms=35000

EDIT: moving GpuTest Tessalation test before the Piano one, still doesn't trigger GPU hang with it. I.e. Tessellation test triggers the GPU hang only only on SKL GT2, Piano on GLK. It seems to be just timing dependent?

changed title from [GEN9] GPU hang in tessellating 3D benchmark with Mesa Iris, then issues in later 3D benchmarks [incomplete recovery] to [GEN9] issues in later 3D benchmarks after first GPU hang [incomplete recovery]

changed the description

GPU hang in Heaven (tesssellation=on) on SkullCanyon with yesterday drm-tip Git: i915_error_state-skullcanyon-heaven.txt

mentioned in issue #2592 (closed)

@eero-t can you try with latest drm-tip and post your observation is issue still seen.

Drm-tip has been so broken since weekend that I don't have much data for last few days:

Suspend Oopses on all machines
X fails to start (especially with Intel HadesCanyon i.e. AMDGPU)

And before that there were many days when Mesa was crashing X.

Anyway, since moving Heaven to be run as last 3D test (in June), situation has been following:

CML-H GT2: only one hang in July
BXT J4205 with ClearLinux + Weston: last hangs in July
BXT J4205 with Ubuntu 20.04 + X/compiz: last hangs in July
SKL-i5 GT2 (6600K): still hangs every other week on average
SKL-i7 GT4e (6770HQ): still hangs weekly on average
GLK J4005: when Heaven is run, still hangs every day, otherwise every 2nd-4th day

=> GLK very much broken, SKL somewhat broken, BXT & CML appear to work nowadays (but that could be just good luck with timings)

As already stated (at least in GLK case), incomplete recovery doesn't require complex & slow Heaven benchmark, it happens already with GpuTest v0.70 Piano benchmark:

[  566.353886] Iteration 1/3: GpuTest /test=pixmark_piano /width=1366 /height=768 /msaa=1 /no_scorebox /benchmark /benchmark_duration_ms=35000
[  569.936547] i915 0000:00:02.0: [drm] Resetting rcs0 for preemption time out
[  569.936575] i915 0000:00:02.0: [drm] GpuTest[2518] context reset due to GPU hang
[  569.949914] i915 0000:00:02.0: [drm] GPU HANG: ecode 9:1:85dffdfb, in GpuTest [2518]
[  569.949918] GPU hangs can indicate a bug anywhere in the entire gfx stack, including userspace.
[  569.949919] Please file a _new_ bug report at https://gitlab.freedesktop.org/drm/intel/issues/new.
[  569.949920] Please see https://gitlab.freedesktop.org/drm/intel/-/wikis/How-to-file-i915-bugs for details.
[  569.949921] drm/i915 developers can then reassign to the right component if it's not a kernel issue.
[  569.949922] The GPU crash dump is required to analyze GPU hangs, so please always attach it.
[  569.949923] GPU crash dump saved to /sys/class/drm/card0/error
[  606.956243] Iteration 2/3: GpuTest /test=pixmark_piano /width=1366 /height=768 /msaa=1 /no_scorebox /benchmark /benchmark_duration_ms=35000
[  609.104474] i915 0000:00:02.0: [drm] Resetting rcs0 for preemption time out
[  609.104504] i915 0000:00:02.0: [drm] GpuTest[2585] context reset due to GPU hang
[  609.118963] i915 0000:00:02.0: [drm] GPU HANG: ecode 9:1:85dffdfb, in GpuTest [2585]
[  647.559004] Iteration 3/3: GpuTest /test=pixmark_piano /width=1366 /height=768 /msaa=1 /no_scorebox /benchmark /benchmark_duration_ms=35000

After GPU hangs, rest of the 3D benchmark results are all completely bogus. From the Piano results, I can see that the first GPU hang is enough to mess up i915, same as with Heaven.

Attached GPU hang is from last GLK run I have. It uses last Friday git versions of drm-tip & mesa: glk-gputest-piano-hang-2020-11-13-drm-tip.txt

On other machines than GLK (which is slower than the other machines with its measly 12 EUs), GPU hangs seems to nowadays require running Heaven. Attached are Heaven hang error states for:

SkullCanyon (6770HQ), from week ago: i915_error_state-heaven-skullcanyon.txt
BXT (J4205), for last evening drm-tip: i915_error_state-heaven-bxt.txt

Btw. I noticed also following being always in BXT dmesg: [drm] *ERROR* mismatch in DDB state pipe A plane 1 (expected (0,0), found (0,254))

assigned to @tejaskux

unassigned @tejaskux

Unigine Heaven is available from: https://benchmark.unigine.com/heaven

Note that my testing is done with Mesa and X server git versions.

Because this is an old bug, e.g. Ubuntu 20.10 has new enough version of Mesa, but even latest X server release version doesn't enable DMA-buf / modifier (i.e. end-to-end render buffer compression) support by default, although it's been enabled in X server git already for few years.

If you cannot trigger the bug with latest Mesa & X release version, you need to enable X server DMA-buf/modifier support with the "dmabuf_capable" option, see: https://manpages.debian.org/unstable/xserver-xorg-core/xorg.conf.5.en.html

One more thing that could affect this is powersave/ondemand. I don't see this issue (on BXT) with ClearLinux user-space which uses performance governor instead of the Ubuntu default powersave one.

Btw. I noticed also following being always in BXT dmesg: [drm] *ERROR* mismatch in DDB state pipe A plane 1 (expected (0,0), found (0,254))

Found when that regression started (a lot later than this issue) => filed it separately as #2857 (closed).

I thought that this doesn't happen any more on all GEN9 HW (just occasionally on SKL GT2 / GT4e, and every time on GLK & BXT), but few days ago there was first CML-H Heaven hang in half a year (with drm-tip v5.11.0-rc3): i915_error_state-heaven-cml-h.txt

PS. There isn't any indication that this is GEN9 specific, I just don't have any other HW for these tests.

For last few days I've been running TGL, and I've seen somewhat similar kind of issue on it too after GPU hangs (#3170 (moved) and #3171 (closed)). I.e. this doesn't seem to be GEN9 specific, but could be a generic issue.

The odd thing is that several tests after the hang can go fine, but few later tests can have totally bogus FPS values (>10x what they should be).

added platform: TGL label

After TGL Manhattan 3.1 hang, few tests following it will have completely bogus FPS: GfxBench Tessellation & T-Rex benchmarks, and often also (GLES/EGL/X) GLB Egypt benchmark run directly after them. Issue happens both with GL/GLX/X and GLES/EGL/Wayland versions of GfxBench.

Rest of the GLB tests don't seem to be impacted by the issue on TGL, and neither are (GL/GLX/X) GpuTest tests (or e.g. Shoc compute tests). Unigine and SynMark tests are impacted by it, but only GL/GLX/X version of SynMark, not GLES/EGL/Wayland one.

Of the SynMark tests, following ones don't seem to be impacted by the bogus FPS (incomplete recovery) issue on TGL:

Batch1 (but the other GPU bound Batch[023] tests are)
Batch[567] (CPU bound)
CSCloth
GSCloth (CPU bound?)
DrvState (CPU bound?)
DrvShComp (CPU bound)
DeferredAA (but Deferred is)
FillTexSingle (but FillPixel is)
GeomPoint
PSPhong (but other PS* tests are)
TexMem128 (but TexMem512 is)

Because above list doesn't have any logic, and it seems random when rest of the tests exhibit the bogus FPS issue, I think triggering of the bogus FPS issue is just timing dependent.

@ickle, any idea why/how GPU hang could start triggering such a thing and why it would not happen to all processes, could it be e.g. some kind of race-condition when context is initialized?

Note: Sometimes similar issue is triggered for SynMark tests without any GfxBench tests being run first, but at which point it happens, seems to be somewhat random, so I haven't been able to pinpoint (from perf data) what SynMark tests are triggering it.

changed title from [GEN9] issues in later 3D benchmarks after first GPU hang [incomplete recovery] to Issues in later 3D benchmarks after first GPU hang [incomplete recovery]

TGL GPU hangs frequently in GfxBench Manhattan31 (#3171 (closed)), SynMark HdrBloom/Multithread (#3170 (moved)), Unigine Valley & Heaven (#1205 (closed)), sometimes also in GfxBench Egypt (#3171 (closed) ?) and SynMark DrvState & FillPixel.

(Because I've seen last two only after Manhattan31 hang, I'm not filing separate bug about that.)

I haven't seen hang recovery problem after SynMark HdrBloom/Multithread GPU hang on TGL, so that's probably unrelated hang issue.

Most of the GPU hangs on TGL-H (GT1) in SynMark DrvState are not properly recovered. Either the following tests will also fail, or some of them report completely bogus FPS.

Attached is one of the latter ones (Mesa from few days ago, DrvState run under Xwayland, drm-tip kernel v5.11): i915_error_state-tgl-drvstate.txt

It may have gone from TGL (but not from BXT/GLK) with latest drm-tip kernel, let's see.

mentioned in issue mesa/mesa#4767 (closed)

mentioned in issue #3457 (closed)

mentioned in issue #562 (closed)

I think this issue was fixed in drm-tip around 20th of July.

I haven't seen this issue on other platforms than GLK for a while, and with drm-tip of that day, on GLK:

Number of GPU hangs increased sharply. Now about every run of Unigine Heaven (3D), GpuTest Piano (3D) or Shoc (CL) causes GPU hang message in dmesg
Runs with large number of these benchmarks complete faster (they don't anymore hit timeouts I've set)
Benchmarks stopped reporting ridiculously high FPSes (10-100x larger than is realistic)

Note: Getting more GPU hangs is not an issue as those affect only individual tests. Last one was a huge problem though, as hang recovery issues completely ruin performance tracking for everything.

closed

mentioned in issue #7901 (closed)

Issues in later 3D benchmarks after first GPU hang [incomplete recovery]

Child items ...

Activity

Admin message

Admin message

Issues in later 3D benchmarks after first GPU hang [incomplete recovery]

Activity