Due to an influx of spam, we have had to impose restrictions on new accounts. Please see this wiki page for instructions on how to get full permissions. Sorry for the inconvenience.
Admin message
The migration is almost done, at least the rest should happen in the background. There are still a few technical difference between the old cluster and the new ones, and they are summarized in this issue. Please pay attention to the TL:DR at the end of the comment.
Created attachment 145683
TerrainFlyTess ICL error state (2019-10-07 drm-tip)
Setup:
HW: ICL-U D1
OS: Ubuntu 18.04 with Unity desktop (compiz)
SW: git versions of drm-tip 5.4-rc2 kernel, X server & Mesa
Desktop uses i965, benchmarks use Iris
Use-case:
* Run SynMark TerrainFlyTess with Iris:
MESA_LOADER_DRIVER_OVERRIDE=iris ./synmark2 OglTerrainFlyTess
Expected outcome:
* Like on GEN9, no GPU hangs
Actual outcome:
* Recoverable GPU hangs, see attachment
* Reproducibility: always
Notes:
* This test-case tests CPU`<->`GPU synchronization, by generating the terrain data on-fly in 4 CPU threads with AVX, for GPU tessellation & rendering
* No idea whether these hangs are an regression. It seems to have happened already >2 weeks ago on drm-tip v5.3 when I first did ICL testing
Kenneth from Mesa team already looked at the error state and commented that:
"This error state makes no sense, ACTHD points at the very start of the batch and IPEHR is 0x18800101 which never appears in the error dump at all. Sounds like a kernel bug to me."
The capture is as it is fetching the first bytes of the batch. Either it took a page fault, or we have a novel means of dying. Note that the GPU did not send the completion event for the context switch in the previous 6s, so I'm erring on the side of novel death throes.
Machine was on loan from Jani and I need to give it back now, so unfortunately I cannot provide that. Test-case & fault should be very easy to reproduce though (pretty much same as the HDRBloom case).
Chris, mail me directly if you don't have SynMark, or would like to get pre-built latest git 3D user-space stack.
Unfortunately neither drm.debug nor CONFIG_LOCKDEP=y shows anything:
-----------------------------------------------------------------------
[ 151.523178] [drm:intel_combo_phy_init [i915]] Combo PHY A already enabled, won't reprogram it.
[ 151.523211] [drm:intel_combo_phy_init [i915]] Combo PHY B already enabled, won't reprogram it.
[ 153.874368] Iteration 1/3: synmark2 OglTerrainFlyTess
[ 180.033099] Iteration 2/3: synmark2 OglTerrainFlyTess
[ 187.993335] i915 0000:00:02.0: GPU HANG: ecode 11:1:0x00000000, hang on rcs0
[ 187.993338] GPU hangs can indicate a bug anywhere in the entire gfx stack, including userspace.
[ 187.993339] Please file a new bug report on bugs.freedesktop.org against DRI -> DRM/Intel
[ 187.993340] drm/i915 developers can then reassign to the right component if it's not a kernel issue.
[ 187.993341] The GPU crash dump is required to analyze GPU hangs, so please always attach it.
[ 187.993342] GPU crash dump saved to /sys/class/drm/card0/error
[ 187.993406] i915 0000:00:02.0: Resetting rcs0 for hang on rcs0
[ 208.151967] Iteration 3/3: synmark2 OglTerrainFlyTess
-----------------------------------------------------------------------
No idea whether the worse reproducibility (1/3 instead of 1/1), is due to using latest git kernel, or debug options. Attached is new error state.
No idea whether the worse reproducibility (1/3 instead of 1/1), is due to
using latest git kernel, or debug options.
With latest drm-tip & Mesa versions, this doesn't anymore hang on every run, maybe just every 1/5th run.
Since CSDof and HDRBloom GPU hangs with Iris continue happening on every run of teh test, and HDRBloom can be easily reproduced also on other platforms, not just ICL, I would concentrate on that (bug 111385).
After this test fails, screen shows last working frame from the test, but it's possible to still run 3D & Media test-cases through ssh, they just fail.
However, at some point after that, the machine will freeze; network goes down and machine doesn't anymore react to other input than SysRq keys.
So far, this bad recovery failures have happened only on SkullCanyon, not on KBL (I don't test ICL anymore).
There were some odd recovery failures on KBL GT3e recently. Tests finished, but "xrandr" call after them hanged infinitely. It seems that something in the GPU hangs or what they cause for Mesa/Iris, seem to have messed up X server.
(On another, almost identical CometLake machine, only TerrainFlyTess test failed with the same SW stack build, i.e. there's some randomness on whether i915 recovery succeeds.)