GPU Hang on Lenovo ThinkPad T15 Gen 2i with Iris Xe Graphics
Machine details: Lenovo T15 Gen 2I CPU: 11th Gen Intel(R) Core(TM) i7-1165G7 @ 2.80GHz RAM: 40B (8GB onboard and 32GB in single DIMM) OS: Arch Linux, Kernel 6.0.9-arch1-1 but also on Kernel 6.0.6-arch1-1 DE: KDE Plasma on X11
Issue: After extended use the GPU crashes killing X in odd ways. Error message in dmesg is as follows,
[124271.446623] i915 0000:00:02.0: [drm] GPU HANG: ecode 12:1:849f7c04, in CanvasRenderer [2450] [124271.447807] i915 0000:00:02.0: [drm] Resetting chip for stopped heartbeat on rcs0 [124271.550040] i915 0000:00:02.0: [drm] CanvasRenderer[2450] context reset due to GPU hang [124271.550100] i915 0000:00:02.0: [drm] GuC firmware i915/tgl_guc_70.1.1.bin version 70.1 [124271.550102] i915 0000:00:02.0: [drm] HuC firmware i915/tgl_huc_7.9.3.bin version 7.9 [124271.557066] i915 0000:00:02.0: [drm] HuC authenticated [124271.557531] i915 0000:00:02.0: [drm] GuC submission enabled [124271.557532] i915 0000:00:02.0: [drm] GuC SLPC enabled
Initially suspected issue in KDE as it has been stable since purchase (almost a year ago) because the dmesg error was not seen. Also seeing OOMKiller firing as GPU related processes are eating ram. So kwin_x11 was suspected but that was ruled out when the following error was seen,
Nov 15 11:29:54 mallenovo kwin_x11[1021]: kwin_core: XCB error: 152 (BadDamage), sequence: 58189, resource id: 17407832, major code: 143 (DAMAGE), minor code: 3 (Subtract) Nov 15 11:30:09 mallenovo kwin_x11[1021]: kwin_core: XCB error: 152 (BadDamage), sequence: 1058, resource id: 17408701, major code: 143 (DAMAGE), minor code: 3 (Subtract) Nov 15 11:31:29 mallenovo kded5[1020]: Service ":1.476" unregistered Nov 15 11:31:57 mallenovo kwin_x11[1021]: kwin_core: XCB error: 152 (BadDamage), sequence: 43279, resource id: 17413666, major code: 143 (DAMAGE), minor code: 3 (Subtract) Nov 15 11:34:13 mallenovo kwin_x11[1021]: kwin_core: XCB error: 152 (BadDamage), sequence: 59254, resource id: 17415424, major code: 143 (DAMAGE), minor code: 3 (Subtract) Nov 15 11:34:54 mallenovo kwin_x11[1021]: The X11 connection broke: I/O error (code 1) Nov 15 11:34:54 mallenovo kwin_x11[1021]: XIO: fatal IO error 22 (Invalid argument) on X server ":0"
As the X server was dying it wasn't kwin to blame. Then I setup external logging for dmesg and captured the above GPU HANG.
Please let me know what other information is required.