[2020.08.07-4] i915 GPU hang report on 5.8.0-1-MANJARO kernel
This is another one case of GPU hang on the same PC (HW + Linux distro). It is my ongoing 1.5-month long rally of PC freezes and GPU hangs.
Since prev. report #2312 (closed) got these packages updates:
grep --text -iE 'installed|upgraded' '/var/log/pacman.log' | tail -n 50
...
<got no any updates since the prev. report>
Further report: #2314 (closed)
My PC experienced about >100 times of (PC freezes + GPU hangs) during last 6 weeks on every kernel 'family' (4.19, 5.4, 5.7, 5.8-rc) avail. in the distro. 4.19 looks like more stable and usually (but far away from always) able to reset GPU and to continue to work without the PC reboot. The more modern kernel version the much faster GPU hangs without any software reset (which 4.19 kernel can do) or PC freezes.
PC freeze or GPU hang usually happens while semi-transparent, fade in/out, blur effects is/are in action.
I have a feeling that fast occurred serie of GPU hangs leads PC to freeze. If only one-two GPU hang happened 'at once' than PC may freeze or may not freeze.
Steps to reproduce the issue in this ticket
It is the next PC boot after the #2312 (closed) issue. Again just the same: after OS was loaded I see typical GUI logon screen with password field. Immediately after cursor stops to blink. Picture freezes. I did not touch keyboard or mouse. End of that case.
In the couple of right next future boot ups (have no error data collected) I was able to enter password but in a few seconds after desktop icons was shown PC freezes. The only way to make PC alive again it up was hard reset. Than I tried 4.19 kernel. PC freezes also on after a desktop was shown but little later than 5.8.0 freezes. With 4.19 it was twice. How I PC was rescued to have ability to load user session? I loaded 4.19, switched to tty2, removed two kernel params which was added in debug (this error report) reasons, uncommented all the /etc/X11/xorg.conf.d/20-intel.conf
file lines and bootup user GUI session succeed.
journalctl
excerpt:
Aug 07 16:21:04.154015 kernel: audit: type=1131 audit(1596817264.147:57): pid=1 uid=0 auid=4294967295 ses=4294967295 subj==unconfined msg='unit=NetworkManager-dispatcher comm="systemd" exe="/usr/lib/systemd/systemd" hostname=? addr=? terminal=? res=success'
Aug 07 16:21:05.688518 kernel: i915 0000:00:02.0: [drm] Resetting rcs0 for preemption time out
Aug 07 16:21:05.689366 kernel: i915 0000:00:02.0: [drm] sddm-greeter[679] context reset due to GPU hang
Aug 07 16:21:05.689947 kernel: i915 0000:00:02.0: [drm:__i915_request_reset.cold [i915]] context sddm-greeter[679]: guilty 1, banned
Aug 07 16:21:05.690518 kernel: i915 0000:00:02.0: [drm:__i915_request_reset.cold [i915]] client sddm-greeter[679]: gained 4 ban score, now 4
Aug 07 16:21:05.696840 kernel: i915 0000:00:02.0: [drm] GPU HANG: ecode 9:1:84df9ffc, in sddm-greeter [679]
Aug 07 16:21:05.697087 kernel: GPU hangs can indicate a bug anywhere in the entire gfx stack, including userspace.
Aug 07 16:21:05.697105 kernel: Please file a _new_ bug report at https://gitlab.freedesktop.org/drm/intel/issues/new.
Aug 07 16:21:05.697127 kernel: Please see https://gitlab.freedesktop.org/drm/intel/-/wikis/How-to-file-i915-bugs for details.
Aug 07 16:21:05.697143 kernel: drm/i915 developers can then reassign to the right component if it's not a kernel issue.
Aug 07 16:21:05.697157 kernel: The GPU crash dump is required to analyze GPU hangs, so please always attach it.
Aug 07 16:21:05.697176 kernel: GPU crash dump saved to /sys/class/drm/card0/error
Aug 07 16:21:05.697191 kernel: ------------[ cut here ]------------
Aug 07 16:21:05.697205 kernel: WARNING: CPU: 2 PID: 0 at kernel/sched/core.c:4488 default_wake_function+0x16/0x30
Aug 07 16:21:05.697219 kernel: Modules linked in: hid_logitech_hidpp mousedev joydev input_leds hid_logitech_dj snd_usb_audio snd_usbmidi_lib snd_hwdep snd_rawmidi hid_generic snd_seq_device mc usbhid snd_pcm snd_timer snd soundcore x86_pkg_temp_thermal intel_powerclamp coretemp kvm_intel rfkill kvm squashfs irqbypass ee1004 i915 crct10dif_pclmul iTCO_wdt intel_pmc_bxt iTCO_vendor_support loop intel_wmi_thunderbolt crc32_pclmul intel_rapl_msr ghash_clmulni_intel aesni_intel nls_iso8859_1 crypto_simd i2c_algo_bit cryptd glue_helper nls_cp437 rapl vfat intel_cstate r8169 fat drm_kms_helper intel_uncore realtek i2c_i801 pcspkr i2c_smbus libphy cec rc_core intel_gtt intel_pch_thermal syscopyarea intel_xhci_usb_role_switch sysfillrect processor_thermal_device sysimgblt intel_rapl_common roles fb_sys_fops intel_soc_dts_iosf wmi int3403_thermal int340x_thermal_zone i2c_hid bmc150_accel_i2c bmc150_accel_core hid industrialio_triggered_buffer kfifo_buf evdev industrialio mac_hid int3400_thermal acpi_thermal_rel drm
Aug 07 16:21:05.704052 kernel: sg crypto_user agpgart ip_tables x_tables ext4 crc32c_generic crc16 mbcache jbd2 xhci_pci xhci_pci_renesas crc32c_intel xhci_hcd
Aug 07 16:21:05.704108 kernel: CPU: 2 PID: 0 Comm: swapper/2 Not tainted 5.8.0-1-MANJARO #1
Aug 07 16:21:05.704146 kernel: Hardware name: Default string Default string/Default string, BIOS 5.12 11/10/2018
Aug 07 16:21:05.704168 kernel: RIP: 0010:default_wake_function+0x16/0x30
Aug 07 16:21:05.704189 kernel: Code: e8 6f de 3d 00 eb 99 0f 0b 66 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 44 00 00 f7 c2 fe ff ff ff 75 09 48 8b 7f 08 e9 0a f9 ff ff <0f> 0b 48 8b 7f 08 e9 ff f8 ff ff 66 66 2e 0f 1f 84 00 00 00 00 00
Aug 07 16:21:05.704206 kernel: RSP: 0018:ffffa45580178d80 EFLAGS: 00010086
Aug 07 16:21:05.704228 kernel: RAX: ffffffff9aee4c40 RBX: ffffa45580403d30 RCX: ffffa45580178d98
Aug 07 16:21:05.704249 kernel: RDX: 00000000fffffffb RSI: 0000000000000003 RDI: ffffa45580403d30
Aug 07 16:21:05.704266 kernel: RBP: ffff92455d97a568 R08: 0000000000000001 R09: 0000000000000000
Aug 07 16:21:05.704283 kernel: R10: ffff92455d1cbc00 R11: 0000000000001001 R12: 0000000000000046
Aug 07 16:21:05.704299 kernel: R13: ffff92455d97a560 R14: ffffa45580178d98 R15: ffff924566868940
Aug 07 16:21:05.704321 kernel: FS: 0000000000000000(0000) GS:ffff924581b00000(0000) knlGS:0000000000000000
Aug 07 16:21:05.704338 kernel: CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
Aug 07 16:21:05.704353 kernel: CR2: 00007f58919fa2d0 CR3: 00000001ffe0a001 CR4: 00000000003606e0
Aug 07 16:21:05.704368 kernel: Call Trace:
Aug 07 16:21:05.704384 kernel: <IRQ>
Aug 07 16:21:05.704400 kernel: autoremove_wake_function+0xe/0x30
Aug 07 16:21:05.704415 kernel: __i915_sw_fence_complete+0x156/0x1b0 [i915]
Aug 07 16:21:05.704431 kernel: dma_i915_sw_fence_wake_timer+0x2c/0x50 [i915]
Aug 07 16:21:05.704446 kernel: signal_irq_work+0x23e/0x350 [i915]
Aug 07 16:21:05.704486 kernel: irq_work_single+0x2c/0x40
Aug 07 16:21:05.704501 kernel: irq_work_run_list+0x2d/0x40
Aug 07 16:21:05.704520 kernel: irq_work_run+0x26/0x40
Aug 07 16:21:05.704536 kernel: __sysvec_irq_work+0x2d/0xf0
Aug 07 16:21:05.704551 kernel: sysvec_irq_work+0x41/0xe0
Aug 07 16:21:05.704566 kernel: asm_sysvec_irq_work+0x12/0x20
Aug 07 16:21:05.704581 kernel: RIP: 0010:__do_softirq+0x93/0x352
Aug 07 16:21:05.704597 kernel: Code: c7 44 24 28 0a 00 00 00 44 89 74 24 04 48 c7 c7 ef 9f 1d 9c e8 8e 19 bf ff 65 66 c7 05 b4 ba 42 64 00 00 fb 66 0f 1f 44 00 00 <48> c7 44 24 08 c0 50 40 9c b8 ff ff ff ff 0f bc 44 24 04 83 c0 01
Aug 07 16:21:05.704613 kernel: RSP: 0018:ffffa45580178f90 EFLAGS: 00000292
Aug 07 16:21:05.704633 kernel: RAX: 0000000000000002 RBX: ffff92457deb9f00 RCX: 000000000000001f
Aug 07 16:21:05.704650 kernel: RDX: 0000000000000000 RSI: ffffffff9c1d9fef RDI: ffffffff9c172046
Aug 07 16:21:05.704668 kernel: RBP: ffffa455800efd60 R08: 000000039effd0d5 R09: 0000000000000000
Aug 07 16:21:05.704688 kernel: R10: 0000000000000000 R11: 0000000000000000 R12: ffff92457c40c200
Aug 07 16:21:05.704704 kernel: R13: ffffffff9af0c920 R14: 0000000000000001 R15: ffffa45580179000
Aug 07 16:21:05.704720 kernel: ? handle_fasteoi_irq+0x210/0x210
Aug 07 16:21:05.704740 kernel: ? handle_irq_event+0x78/0xb0
Aug 07 16:21:05.704756 kernel: ? handle_fasteoi_irq+0x210/0x210
Aug 07 16:21:05.704771 kernel: asm_call_on_stack+0x12/0x20
Aug 07 16:21:05.704786 kernel: </IRQ>
Aug 07 16:21:05.704802 kernel: do_softirq_own_stack+0x5f/0x80
Aug 07 16:21:05.704817 kernel: irq_exit_rcu+0xcb/0x120
Aug 07 16:21:05.704832 kernel: common_interrupt+0xd1/0x200
Aug 07 16:21:05.704847 kernel: asm_common_interrupt+0x1e/0x40
Aug 07 16:21:05.704868 kernel: RIP: 0010:cpuidle_enter_state+0xb6/0x420
Aug 07 16:21:05.704887 kernel: Code: 80 76 a2 64 e8 5b 3d 8e ff 49 89 c7 0f 1f 44 00 00 31 ff e8 8c 4b 8e ff 80 7c 24 0f 00 0f 85 06 02 00 00 fb 66 0f 1f 44 00 00 <45> 85 e4 0f 88 e9 01 00 00 49 63 d4 4c 2b 7c 24 10 48 8d 04 52 48
Aug 07 16:21:05.704904 kernel: RSP: 0018:ffffa455800efe78 EFLAGS: 00000246
Aug 07 16:21:05.704919 kernel: RAX: ffff924581b00000 RBX: ffff924581b36800 RCX: 000000000000001f
Aug 07 16:21:05.704937 kernel: RDX: 0000000000000000 RSI: ffffffff9c16a0b2 RDI: ffffffff9c149f8f
Aug 07 16:21:05.704957 kernel: RBP: ffffffff9c4c9bc0 R08: 000000039effc406 R09: 0000000000000020
Aug 07 16:21:05.704974 kernel: R10: 000000000002caa6 R11: 00000000000031ea R12: 0000000000000008
Aug 07 16:21:05.704990 kernel: R13: ffff924581b36800 R14: 0000000000000008 R15: 000000039effc406
Aug 07 16:21:05.705009 kernel: ? cpuidle_enter_state+0xa4/0x420
Aug 07 16:21:05.705028 kernel: cpuidle_enter+0x29/0x40
Aug 07 16:21:05.705044 kernel: do_idle+0x1fb/0x2c0
Aug 07 16:21:05.705060 kernel: cpu_startup_entry+0x19/0x20
Aug 07 16:21:05.705075 kernel: start_secondary+0x178/0x1c0
Aug 07 16:21:05.705094 kernel: secondary_startup_64+0xb6/0xc0
Aug 07 16:21:05.705118 kernel: ---[ end trace 82986c21e6f5be58 ]---
How often does the steps listed above trigger the issue
Frequency of the PC freeze or GPU hang are near highest possible. It could happen on logon screen without any user activity or during GUI session actions: on a first or 5th or 40th minute. Average is about 1-2 minutes. It is not a concrete exact action, it is general unexpected case and it did happen in (m)any types of typical user activity such as:
-) on logon screen (without any user action, even mouse touch; saw that for about 5-6 times),
-) moving desktop icons,
-) open start menu,
-) open context menu,
-) moving cursor in the text editor via keyboard navigation keys,
-) surfing in system settings window,
-) typing text in terminal emulator (GUI),
-) installing updates in GUI app or GUI terminal emulator,
-) open or surfing in Opera web browser: list of gitlab commits viewing, watching youtube videos (not fullscreen and not even touch keyboard and mice at least for about last 1-2 minutes), extremely fast freeze/crash while surfing maps.google.com, maps.ya.ru,
etc.
Platform (CPU): Intel Core i5-8250U
System architecture: uname -m
: x86_64
Kernel version: uname -r
: 5.8.0-1-MANJARO
Linux distribution: Manjaro Linux (desktop environment: KDE)
Machine or motherboard model: Hystou Fanless Mini PC P03B-i5-8250U
Display connector: factory-made cable with connectors: HDMI
(connected to PC) - DVI-D
(connected to monitor)
Also:
KDE System Settings
has default Composer
settings.
The GPU settings file /etc/X11/xorg.conf.d/20-intel.conf
is empty.
Error data gathered after switch to tty2 collect_GPU_crash_data.zip, which collects:
# Collect main data
sudo cp /sys/class/drm/card0/error ...
sudo dmesg
journalctl -b -o short-precise --no-hostname --dmesg
journalctl -b -o short-precise --no-hostname
cat /proc/cmdline
# Collect supplementary data
xrandr --verbose
sudo dmidecode -t bios -t system -t baseboard -t chassis -t processor
mhwd -l -d
cp /etc/X11/xorg.conf.d/20-intel.conf ...
sudo lspci -vvv -G
sudo lspci -vvv -G -H1
sudo lspci -vvv -G -H2
lscpu
lsmod
modinfo i915
modinfo drm
modinfo drm_kms_helper
modinfo intel_gtt
modinfo i2c_algo_bit
sudo systool -v -m i915
sudo systool -v -m drm
sudo systool -v -m drm_kms_helper
sudo systool -v -m intel_gtt
sudo systool -v -m i2c_algo_bit
uname -m
uname -r
inxi -CIGMxxx --no-host
/sys/class/drm/card0/error
file alone:
0_content_of__sys_class_drm_card0_error.zip
Whole gathered data (including the error
file above) are in the archive: