[GEN9+] Non-recoverable GPU hangs in SynMark2 OglTerrainFly* with Iris

added Community GPU hang platform: ICL platform: KBL platform: SKL priority::high severity::critical + 1 deleted label

Chris Wilson @ickle said:

The capture is as it is fetching the first bytes of the batch. Either it took a page fault, or we have a novel means of dying. Note that the GPU did not send the completion event for the context switch in the previous 6s, so I'm erring on the side of novel death throes.

Chris Wilson @ickle said:

One thing that would be useful as it is reproducible would be to enable CONFIG_DRM_I915_DEBUG_GEM and attach the drm.debug=0x2 dmesg.

Eero Tamminen @eero-t said:

Machine was on loan from Jani and I need to give it back now, so unfortunately I cannot provide that. Test-case & fault should be very easy to reproduce though (pretty much same as the HDRBloom case).

Chris, mail me directly if you don't have SynMark, or would like to get pre-built latest git 3D user-space stack.

Chris Wilson @ickle said:

It's Icelake that continues to be a myth. I shall pester Francesco if we can at least get icl-gem.

Eero Tamminen @eero-t uploaded an attachment:

Jani extended the ICL loan.

Unfortunately neither drm.debug nor CONFIG_LOCKDEP=y shows anything:
-----------------------------------------------------------------------
[ 151.523178] [drm:intel_combo_phy_init [i915]] Combo PHY A already enabled, won't reprogram it.
[ 151.523211] [drm:intel_combo_phy_init [i915]] Combo PHY B already enabled, won't reprogram it.
[ 153.874368] Iteration 1/3: synmark2 OglTerrainFlyTess
[ 180.033099] Iteration 2/3: synmark2 OglTerrainFlyTess
[ 187.993335] i915 0000:00:02.0: GPU HANG: ecode 11:1:0x00000000, hang on rcs0
[ 187.993338] GPU hangs can indicate a bug anywhere in the entire gfx stack, including userspace.
[ 187.993339] Please file a new bug report on bugs.freedesktop.org against DRI -> DRM/Intel
[ 187.993340] drm/i915 developers can then reassign to the right component if it's not a kernel issue.
[ 187.993341] The GPU crash dump is required to analyze GPU hangs, so please always attach it.
[ 187.993342] GPU crash dump saved to /sys/class/drm/card0/error
[ 187.993406] i915 0000:00:02.0: Resetting rcs0 for hang on rcs0
[ 208.151967] Iteration 3/3: synmark2 OglTerrainFlyTess
-----------------------------------------------------------------------

No idea whether the worse reproducibility (1/3 instead of 1/1), is due to using latest git kernel, or debug options. Attached is new error state.

Attachment 145684, "TerrainFlyTess ICL error state (2019-10-09 drm-tip)":
icl-synmark_terrainflytess-i915_error_state.txt

Eero Tamminen @eero-t said:

(In reply to Eero Tamminen from comment 5)

No idea whether the worse reproducibility (1/3 instead of 1/1), is due to
using latest git kernel, or debug options.

With latest drm-tip & Mesa versions, this doesn't anymore hang on every run, maybe just every 1/5th run.

Since CSDof and HDRBloom GPU hangs with Iris continue happening on every run of teh test, and HDRBloom can be easily reproduced also on other platforms, not just ICL, I would concentrate on that (bug 111385).

Eero Tamminen @eero-t said:

With the latest drm-tip git kernel (from last evening), this happened also on the SKL & KBL, so I assume its GEN9+ Core issues.

Eero Tamminen @eero-t uploaded an attachment:

Attachment 145785, "TerrainFlyTess SKL GT4e error state (2019-10-20 drm-tip)":
skl-gt4e-terrainflytess-iris-i915_error_state.txt

Eero Tamminen @eero-t said:

(Recoverable) GPU hangs have started to happen now also in the TerrainFlyInst tests (on SKL) and in TerrainPanTess (on SKL GT4e).

Eero Tamminen @eero-t said:

Last night OglTerrainFlyInst test GPU hang didn't anymore recover, but system hanged SKL GT4e.

Eero Tamminen @eero-t said:

After this test fails, screen shows last working frame from the test, but it's possible to still run 3D & Media test-cases through ssh, they just fail.

However, at some point after that, the machine will freeze; network goes down and machine doesn't anymore react to other input than SysRq keys.

So far, this bad recovery failures have happened only on SkullCanyon, not on KBL (I don't test ICL anymore).

Eero Tamminen @eero-t uploaded an attachment:

What happens with Iris is a bit odd:

First SynMark2 Multithread fails, but there's no GPU hang
A bit later TerrainFlyInst doesn't fail, but triggers the attached GPU hang [1]
After few successful runs for other tests, screen updates stop, and all further GPU tests fail, but there's no indication of any problem in dmesg

[1] dmesg:
[ 4859.448337] Iteration 2/3: synmark2 OglTerrainFlyInst
[ 4876.890967] i915 0000:00:02.0: Resetting rcs0 for preemption time out
[ 4876.891740] [drm:gen8_reset_engines [i915]] *ERROR* rcs0 reset request timed out: {request: 00000001, RESET_CTL: 00000001}
[ 4883.930958] i915 0000:00:02.0: Resetting rcs0 for preemption time out
[ 4883.931728] [drm:gen8_reset_engines [i915]] *ERROR* rcs0 reset request timed out: {request: 00000001, RESET_CTL: 00000001}
[ 4886.938950] i915 0000:00:02.0: Resetting rcs0 for preemption time out
[ 4886.939713] [drm:gen8_reset_engines [i915]] *ERROR* rcs0 reset request timed out: {request: 00000001, RESET_CTL: 00000001}
[ 4889.883259] i915 0000:00:02.0: GPU HANG: ecode 9:1:0x00000000, stopped heartbeat on rcs0
[ 4889.883262] GPU hangs can indicate a bug anywhere in the entire gfx stack, including userspace.
[ 4889.883264] Please file a _new_ bug report on bugs.freedesktop.org against DRI -> DRM/Intel
[ 4889.883265] drm/i915 developers can then reassign to the right component if it's not a kernel issue.
[ 4889.883266] The GPU crash dump is required to analyze GPU hangs, so please always attach it.
[ 4889.883267] GPU crash dump saved to /sys/class/drm/card0/error
[ 4889.984912] i915 0000:00:02.0: Resetting rcs0 for stopped heartbeat on rcs0
[ 4889.985681] [drm:gen8_reset_engines [i915]] *ERROR* rcs0 reset request timed out: {request: 00000001, RESET_CTL: 00000001}
[ 4889.986705] i915 0000:00:02.0: Resetting rcs0 for preemption time out
[ 4889.987465] [drm:gen8_reset_engines [i915]] *ERROR* rcs0 reset request timed out: {request: 00000001, RESET_CTL: 00000001}
[ 4889.987735] i915 0000:00:02.0: Resetting chip for stopped heartbeat on rcs0
[ 4890.088868] [drm] GuC communication stopped
[ 4890.089600] [drm:gen8_reset_engines [i915]] *ERROR* rcs0 reset request timed out: {request: 00000001, RESET_CTL: 00000001}
[ 4890.090325] [drm:gen8_reset_engines [i915]] *ERROR* rcs0 reset request timed out: {request: 00000001, RESET_CTL: 00000001}
[ 4890.091936] [drm] GuC communication enabled
[ 4890.091978] i915 0000:00:02.0: GuC firmware i915/skl_guc_33.0.0.bin version 33.0 submission:disabled
[ 4890.091980] i915 0000:00:02.0: HuC firmware i915/skl_huc_2.0.0.bin version 2.0 authenticated:yes
[ 4890.323457] Iteration 3/3: synmark2 OglTerrainFlyInst
[ 4891.401947] i915 0000:00:02.0: Resetting rcs0 for preemption time out
[ 4891.514932] i915 0000:00:02.0: Resetting rcs0 for preemption time out
[ 4916.787572] Iteration 1/3: synmark2 OglTerrainPanInst
[ 4943.096357] Iteration 2/3: synmark2 OglTerrainPanInst
...

Attachment 145899, "TerrainFlyInst SKL GT4e error state (drm-tip 2019-11-05)":
i915-error-terrrainflyinst-skullcanyon.txt

Chris Wilson @ickle said:

Device: Mesa DRI Intel(R) Iris(R) Plus Graphics (Ice Lake 8x8 GT2) (0x8a52)
OpenGL renderer string: Mesa DRI Intel(R) Iris(R) Plus Graphics (Ice Lake 8x8 GT2)
$ ./synmark2 OglTerrainFlyTess
is proving unproblematic.

I don't have a skl gt4e; but I do have a kbl gt4e. Close enough?

Eero Tamminen @eero-t said:

(In reply to Chris Wilson from comment 13)

Device: Mesa DRI Intel(R) Iris(R) Plus Graphics (Ice Lake 8x8 GT2) (0x8a52)
OpenGL renderer string: Mesa DRI Intel(R) Iris(R) Plus Graphics (Ice Lake
8x8 GT2)
$ ./synmark2 OglTerrainFlyTess
is proving unproblematic.

It's not 100% reproducible, you need to run it several times, try e.g. 10-20 times.

(Test is run 3x every night for latest 3D stack. I don't see it every night, but it does happen several times a week.)

> I don't have a skl gt4e; but I do have a kbl gt4e. Close enough?

Yes, the issue does happen more often with higher GT versions (faster -> triggers some race-condition more easily?).

Eero Tamminen @eero-t said:

And on GEN9, you still need MESA_LOADER_DRIVER_OVERRIDE=iris to use Iris instead of i965.

Eero Tamminen @eero-t said:

In the last few days, I've seen this only on SkullCanyon (SKL GT4e), not on KBL GT3e.

Note: SkullCanyon is running nowadays Weston, KBL GT3e is still running X11/Unity.

(I don't have anymore ICL, it was loan from Jani. Let's see when I can loan it again.)

Eero Tamminen @eero-t said:

(In reply to Eero Tamminen from comment 16)

In the last few days, I've seen this only on SkullCanyon (SKL GT4e), not on
KBL GT3e.

I mean, the GPU hangs still happen on all machines, but the recovery fails only on SkullCanyon, not KBL GT3e (or SKL GT2).

There were some odd recovery failures on KBL GT3e recently. Tests finished, but "xrandr" call after them hanged infinitely. It seems that something in the GPU hangs or what they cause for Mesa/Iris, seem to have messed up X server.

Attached is error state from CometLake: cml-synmark_terrainflytess-iris-i915_error_state.txt

After TerrainFlyTess fails, all the 3D and media test runs following it fail also, multiple times. For example:

[16832.868737] Iteration 3/3: ffmpeg -hwaccel qsv -qsv_device /dev/dri/renderD128 -c:v h264_qsv -i 720x480p_30.00_4mb_h264_cabac_180s.264 -c:v mpeg2_qsv -b:v 2000K -compression_level 4 -an -vframes 4800 -y output.mpg
[16834.042063] i915 0000:00:02.0: Resetting rcs0 for preemption time out
[16834.042838] [drm:gen8_reset_engines [i915]] *ERROR* rcs0 reset request timed out: {request: 00000001, RESET_CTL: 00000001}
[16837.150832] i915 0000:00:02.0: Resetting rcs0 for stopped heartbeat on rcs0
[16837.151609] [drm:gen8_reset_engines [i915]] *ERROR* rcs0 reset request timed out: {request: 00000001, RESET_CTL: 00000001}
[16837.151703] i915 0000:00:02.0: Resetting chip for stopped heartbeat on rcs0
[16837.252842] [drm] GuC communication stopped
[16837.253620] [drm:gen8_reset_engines [i915]] *ERROR* rcs0 reset request timed out: {request: 00000001, RESET_CTL: 00000001}
[16837.254389] [drm:gen8_reset_engines [i915]] *ERROR* rcs0 reset request timed out: {request: 00000001, RESET_CTL: 00000001}
[16837.255509] [drm] GuC communication enabled
[16837.255532] i915 0000:00:02.0: GuC firmware i915/kbl_guc_33.0.0.bin version 33.0 submission:disabled
[16837.255536] i915 0000:00:02.0: HuC firmware i915/kbl_huc_4.0.0.bin version 4.0 authenticated:yes
[16837.945836] i915 0000:00:02.0: Resetting rcs0 for preemption time out
...

(On another, almost identical CometLake machine, only TerrainFlyTess test failed with the same SW stack build, i.e. there's some randomness on whether i915 recovery succeeds.)

[GEN9+] Non-recoverable GPU hangs in SynMark2 OglTerrainFly* with Iris

Submitted by Eero Tamminen `@eero-t`

Description

See also

Child items ...

Activity

Admin message

Admin message

[GEN9+] Non-recoverable GPU hangs in SynMark2 OglTerrainFly* with Iris

Submitted by Eero Tamminen @eero-t

Description

See also

Activity

Submitted by Eero Tamminen `@eero-t`