GT0: GUC: Engine reset failed on 0:0 (rcs0) because 0x00000000
This GPU crash happened after ~42h of uptime on an Thinkpad T14 Intel Gen 4. Backstory: The GPU hung up previously multiple times with "GuC firmware i915/adlp_guc_70.bin version 70.13.1" (between 1 and 3 days of running), but the GPU didn't crash and was just stuck (blocked task in intel_pipe_update_end); the rest of the system kept runnining fine. I downgraded to "GuC firmware i915/adlp_guc_70.bin version 70.5.1" and the system ran fine for over 60 days (then the GPU locked up and the whole system hanged).
I now rebooted with 70.13.1 and could finally gather a crash dump on Kernel 6.7.4.
I did not do anything special, xscreensaver was running a GL demo, so OpenGL was actively used.
The machine is a ThinkPad T14 Gen 4 PF4NDRMJ (Intel) running a stock Void Linux x86_64 kernel (equivalent to vanilla kernel.org):
% uname -a
Linux hera 6.7.4_1 #1 SMP PREEMPT_DYNAMIC Wed Feb 7 19:24:35 UTC 2024 x86_64 GNU/Linux
% lspci -vnn -d :*:0300
00:02.0 VGA compatible controller [0300]: Intel Corporation Raptor Lake-P [Iris Xe Graphics] [8086:a7a1] (rev 04) (prog-if 00 [VGA controller])
Subsystem: Lenovo Device [17aa:230e] Flags: bus master, fast devsel, latency 0, IRQ 141, IOMMU group 0
Memory at 603c000000 (64-bit, non-prefetchable) [size=16M]
Memory at 4000000000 (64-bit, prefetchable) [size=256M]
I/O ports at 2000 [size=64]
Expansion ROM at 000c0000 [virtual] [disabled] [size=128K]
Capabilities: [40] Vendor Specific Information: Len=0c <?>
Capabilities: [70] Express Root Complex Integrated Endpoint, MSI 00
Capabilities: [ac] MSI: Enable+ Count=1/1 Maskable+ 64bit-
Capabilities: [d0] Power Management version 2
Capabilities: [100] Process Address Space ID (PASID)
Capabilities: [200] Address Translation Service (ATS)
Capabilities: [300] Page Request Interface (PRI)
Capabilities: [320] Single Root I/O Virtualization (SR-IOV)
Kernel driver in use: i915
Kernel modules: i915
dmesg of crash part:
[151055.398471] i915 0000:00:02.0: [drm] *ERROR* GT0: GUC: Engine reset failed on 0:0 (rcs0) because 0x00000000
[151055.435102] i915 0000:00:02.0: [drm] GPU HANG: ecode 12:1:84dffffb, in gibson:gdrv0 [18384]
[151055.435106] GPU hangs can indicate a bug anywhere in the entire gfx stack, including userspace.
[151055.435112] Please file a _new_ bug report at https://gitlab.freedesktop.org/drm/intel/issues/new.
[151055.435112] Please see https://gitlab.freedesktop.org/drm/intel/-/wikis/How-to-file-i915-bugs for details.
[151055.435112] drm/i915 developers can then reassign to the right component if it's not a kernel issue.
[151055.435113] The GPU crash dump is required to analyze GPU hangs, so please always attach it.
[151055.435113] GPU crash dump saved to /sys/class/drm/card0/error
[151055.435248] i915 0000:00:02.0: [drm] GT0: Resetting chip for GuC failed to reset engine mask=0x1
[151055.537453] i915 0000:00:02.0: [drm] *ERROR* GT0: rcs0 reset request timed out: {request: 00000001, RESET_CTL: 00000001}
[151055.538165] i915 0000:00:02.0: [drm] *ERROR* GT0: rcs0 reset request timed out: {request: 00000001, RESET_CTL: 00000001}
[151055.538306] i915 0000:00:02.0: [drm] gibson:gdrv0[18384] context reset due to GPU hang
[151055.538366] i915 0000:00:02.0: [drm] GT0: GuC firmware i915/adlp_guc_70.bin version 70.13.1
[151055.538369] i915 0000:00:02.0: [drm] GT0: HuC firmware i915/tgl_huc.bin version 7.9.3
[151055.556102] i915 0000:00:02.0: [drm] GT0: HuC: authenticated for all workloads
[151055.556521] i915 0000:00:02.0: [drm] GT0: GUC: submission enabled
[151055.556522] i915 0000:00:02.0: [drm] GT0: GUC: SLPC enabled
[151069.225354] i915 0000:00:02.0: [drm] *ERROR* GT0: GUC: Engine reset failed on 0:0 (rcs0) because 0x00000000
[151069.267220] i915 0000:00:02.0: [drm] GPU HANG: ecode 12:1:84dffffb, in gibson:gdrv0 [18384]
[151069.267304] i915 0000:00:02.0: [drm] GT0: Resetting chip for GuC failed to reset engine mask=0x1
[151069.370417] i915 0000:00:02.0: [drm] *ERROR* GT0: rcs0 reset request timed out: {request: 00000001, RESET_CTL: 00000001}
[151069.371161] i915 0000:00:02.0: [drm] *ERROR* GT0: rcs0 reset request timed out: {request: 00000001, RESET_CTL: 00000001}
[151069.371317] i915 0000:00:02.0: [drm] gibson:gdrv0[18384] context reset due to GPU hang
[151069.371417] i915 0000:00:02.0: [drm] GT0: GuC firmware i915/adlp_guc_70.bin version 70.13.1
[151069.371425] i915 0000:00:02.0: [drm] GT0: HuC firmware i915/tgl_huc.bin version 7.9.3
[151069.389993] i915 0000:00:02.0: [drm] GT0: HuC: authenticated for all workloads
[151069.391004] i915 0000:00:02.0: [drm] GT0: GUC: submission enabled
[151069.391016] i915 0000:00:02.0: [drm] GT0: GUC: SLPC enabled
[151078.588765] i915 0000:00:02.0: [drm] *ERROR* GT0: GUC: Engine reset failed on 0:0 (rcs0) because 0x00000000
[151078.625076] i915 0000:00:02.0: [drm] GPU HANG: ecode 12:1:84dffffb, in gibson:gdrv0 [18384]
[151078.625335] i915 0000:00:02.0: [drm] GT0: Resetting chip for GuC failed to reset engine mask=0x1
[151078.727973] i915 0000:00:02.0: [drm] *ERROR* GT0: rcs0 reset request timed out: {request: 00000001, RESET_CTL: 00000001}
[151078.728701] i915 0000:00:02.0: [drm] *ERROR* GT0: rcs0 reset request timed out: {request: 00000001, RESET_CTL: 00000001}
[151078.728893] i915 0000:00:02.0: [drm] gibson:gdrv0[18384] context reset due to GPU hang
[151078.729013] i915 0000:00:02.0: [drm] GT0: GuC firmware i915/adlp_guc_70.bin version 70.13.1
[151078.729026] i915 0000:00:02.0: [drm] GT0: HuC firmware i915/tgl_huc.bin version 7.9.3
[151078.747763] i915 0000:00:02.0: [drm] GT0: HuC: authenticated for all workloads
[151078.748189] i915 0000:00:02.0: [drm] GT0: GUC: submission enabled
[151078.748193] i915 0000:00:02.0: [drm] GT0: GUC: SLPC enabled
full dmesg is attached: dmesg.gz
GPU error log is attached: error.gz