[2020.08.07-6] i915 GPU hang report on 5.8.0-1-MANJARO kernel
This is another one case of GPU hang on the same PC (HW + Linux distro). It is my ongoing 1.5-month long rally of PC freezes and GPU hangs.
Since prev. report #2314 (closed) got these packages updates:
grep --text -iE 'installed|upgraded' '/var/log/pacman.log' | tail -n 50
...
<got no updates since the prev. ticket>
Further ticket: #2326 (closed)
My PC experienced about >100 times of (PC freezes + GPU hangs) during last 6 weeks on every kernel 'family' (4.19, 5.4, 5.7, 5.8-rc) avail. in the distro. 4.19 looks like more stable and usually (but far away from always) able to reset GPU and to continue to work without the PC reboot. The more modern kernel version the much faster GPU hangs without any software reset (which 4.19 kernel can do) or PC freezes.
PC freeze or GPU hang usually happens while semi-transparent, fade in/out, blur effects is/are in action.
I have a feeling that fast occurred serie of GPU hangs leads PC to freeze. If only one-two GPU hang happened 'at once' than PC may freeze or may not freeze.
Steps to reproduce the issue in this ticket
I was typing a description of the prev. hang issue #2314 (closed) in Opera web browser. I switched active window to Krusader, opened dumped journalctl file (~70 MB) via Krusader's Viewer tool (F3
hot key). Started the phrase search GPU
or GPU
. Found several items close to each other, pressed F3
again and... got GPU hang with 3 picture changing each other. Made photos and than I press the hot key to collect error data.
journalctl
excerpt:
Aug 07 20:39:02.817278 kernel: audit: type=1130 audit(1596832742.810:100): pid=1 uid=0 auid=4294967295 ses=4294967295 subj==unconfined msg='unit=systemd-tmpfiles-clean comm="systemd" exe="/usr/lib/systemd/systemd" hostname=? addr=? terminal=? res=success'
Aug 07 20:39:02.817359 kernel: audit: type=1131 audit(1596832742.810:101): pid=1 uid=0 auid=4294967295 ses=4294967295 subj==unconfined msg='unit=systemd-tmpfiles-clean comm="systemd" exe="/usr/lib/systemd/systemd" hostname=? addr=? terminal=? res=success'
Aug 07 20:39:02.901354 kernel: Asynchronous wait on fence 0000:00:02.0:kwin_x11[805]:3828 timed out (hint:intel_atomic_commit_ready [i915])
Aug 07 20:39:02.901507 kernel: ------------[ cut here ]------------
Aug 07 20:39:02.901605 kernel: WARNING: CPU: 0 PID: 0 at kernel/sched/core.c:4488 default_wake_function+0x16/0x30
Aug 07 20:39:02.901681 kernel: Modules linked in: snd_seq_dummy snd_hrtimer snd_seq fuse hid_logitech_hidpp joydev mousedev input_leds hid_logitech_dj snd_usb_audio snd_usbmidi_lib snd_hwdep snd_rawmidi snd_seq_device mc snd_pcm snd_timer snd soundcore hid_generic usbhid rfkill i915 x86_pkg_temp_thermal intel_powerclamp coretemp kvm_intel squashfs kvm irqbypass loop crct10dif_pclmul crc32_pclmul ghash_clmulni_intel iTCO_wdt intel_pmc_bxt iTCO_vendor_support ee1004 aesni_intel i2c_algo_bit intel_rapl_msr intel_wmi_thunderbolt nls_iso8859_1 crypto_simd nls_cp437 cryptd vfat glue_helper fat drm_kms_helper rapl intel_cstate cec r8169 intel_uncore rc_core i2c_i801 realtek intel_gtt pcspkr i2c_smbus syscopyarea libphy sysfillrect intel_pch_thermal processor_thermal_device sysimgblt intel_xhci_usb_role_switch intel_rapl_common roles fb_sys_fops intel_soc_dts_iosf wmi int3403_thermal int340x_thermal_zone bmc150_accel_i2c bmc150_accel_core industrialio_triggered_buffer kfifo_buf i2c_hid hid industrialio evdev mac_hid
Aug 07 20:39:02.905220 kernel: int3400_thermal acpi_thermal_rel drm sg crypto_user agpgart ip_tables x_tables ext4 crc32c_generic crc16 mbcache jbd2 xhci_pci xhci_pci_renesas xhci_hcd crc32c_intel
Aug 07 20:39:02.905337 kernel: CPU: 0 PID: 0 Comm: swapper/0 Not tainted 5.8.0-1-MANJARO #1
Aug 07 20:39:02.905430 kernel: Hardware name: Default string Default string/Default string, BIOS 5.12 11/10/2018
Aug 07 20:39:02.905528 kernel: RIP: 0010:default_wake_function+0x16/0x30
Aug 07 20:39:02.905601 kernel: Code: e8 6f de 3d 00 eb 99 0f 0b 66 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 44 00 00 f7 c2 fe ff ff ff 75 09 48 8b 7f 08 e9 0a f9 ff ff <0f> 0b 48 8b 7f 08 e9 ff f8 ff ff 66 66 2e 0f 1f 84 00 00 00 00 00
Aug 07 20:39:02.905673 kernel: RSP: 0018:ffffa8a800003e58 EFLAGS: 00010082
Aug 07 20:39:02.905742 kernel: RAX: ffffffff90ae4c40 RBX: ffffa8a800f9fd30 RCX: ffffa8a800003e70
Aug 07 20:39:02.905819 kernel: RDX: 00000000ffffff92 RSI: 0000000000000003 RDI: ffffa8a800f9fd30
Aug 07 20:39:02.905887 kernel: RBP: ffff99f383a47568 R08: 0000000000016473 R09: 0000000000000001
Aug 07 20:39:02.905963 kernel: R10: 0000000000000000 R11: 0000000000000001 R12: 0000000000000046
Aug 07 20:39:02.906039 kernel: R13: ffff99f383a47560 R14: ffffa8a800003e70 R15: ffff99f3d124d228
Aug 07 20:39:02.906109 kernel: FS: 0000000000000000(0000) GS:ffff99f401a00000(0000) knlGS:0000000000000000
Aug 07 20:39:02.906197 kernel: CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
Aug 07 20:39:02.906292 kernel: CR2: 000055a7213fe048 CR3: 000000023b80a003 CR4: 00000000003606f0
Aug 07 20:39:02.906371 kernel: Call Trace:
Aug 07 20:39:02.906454 kernel: <IRQ>
Aug 07 20:39:02.906517 kernel: autoremove_wake_function+0xe/0x30
Aug 07 20:39:02.906593 kernel: __i915_sw_fence_complete+0x156/0x1b0 [i915]
Aug 07 20:39:02.906655 kernel: ? i915_sw_fence_complete+0x20/0x20 [i915]
Aug 07 20:39:02.906731 kernel: ? i915_sw_fence_complete+0x20/0x20 [i915]
Aug 07 20:39:02.906798 kernel: call_timer_fn+0x2d/0x160
Aug 07 20:39:02.906878 kernel: ? i915_sw_fence_complete+0x20/0x20 [i915]
Aug 07 20:39:02.906940 kernel: __run_timers+0x130/0x290
Aug 07 20:39:02.907017 kernel: run_timer_softirq+0x2b/0x50
Aug 07 20:39:02.907093 kernel: __do_softirq+0x10f/0x352
Aug 07 20:39:02.907162 kernel: asm_call_on_stack+0x12/0x20
Aug 07 20:39:02.907255 kernel: </IRQ>
Aug 07 20:39:02.907329 kernel: do_softirq_own_stack+0x5f/0x80
Aug 07 20:39:02.907399 kernel: irq_exit_rcu+0xcb/0x120
Aug 07 20:39:02.907468 kernel: sysvec_apic_timer_interrupt+0x46/0xe0
Aug 07 20:39:02.907537 kernel: asm_sysvec_apic_timer_interrupt+0x12/0x20
Aug 07 20:39:02.907797 kernel: RIP: 0010:cpuidle_enter_state+0xb6/0x420
Aug 07 20:39:02.907879 kernel: Code: 80 76 e2 6e e8 5b 3d 8e ff 49 89 c7 0f 1f 44 00 00 31 ff e8 8c 4b 8e ff 80 7c 24 0f 00 0f 85 06 02 00 00 fb 66 0f 1f 44 00 00 <45> 85 e4 0f 88 e9 01 00 00 49 63 d4 4c 2b 7c 24 10 48 8d 04 52 48
Aug 07 20:39:02.907951 kernel: RSP: 0018:ffffffff92003e40 EFLAGS: 00000246
Aug 07 20:39:02.908028 kernel: RAX: ffff99f401a00000 RBX: ffff99f401a36800 RCX: 000000000000001f
Aug 07 20:39:02.908097 kernel: RDX: 0000000000000000 RSI: ffffffff91d6a0b2 RDI: ffffffff91d49f8f
Aug 07 20:39:02.908166 kernel: RBP: ffffffff920c9bc0 R08: 000000d230ae7294 R09: 0000000000000018
Aug 07 20:39:02.908264 kernel: R10: 00000000000019ab R11: 0000000000000b40 R12: 0000000000000008
Aug 07 20:39:02.908347 kernel: R13: ffff99f401a36800 R14: 0000000000000008 R15: 000000d230ae7294
Aug 07 20:39:02.908420 kernel: cpuidle_enter+0x29/0x40
Aug 07 20:39:02.908489 kernel: do_idle+0x1fb/0x2c0
Aug 07 20:39:02.908558 kernel: cpu_startup_entry+0x19/0x20
Aug 07 20:39:02.908635 kernel: start_kernel+0x843/0x868
Aug 07 20:39:02.908722 kernel: secondary_startup_64+0xb6/0xc0
Aug 07 20:39:02.908792 kernel: ---[ end trace 86c32d5df596db55 ]---
Aug 07 20:39:02.908869 kernel: [drm:drm_atomic_state_default_clear [drm]] Clearing atomic state 00000000b1614ff6
Aug 07 20:39:02.908956 kernel: [drm:__drm_atomic_state_free [drm]] Freeing atomic state 00000000b1614ff6
Aug 07 20:39:05.488494 kernel: i915 0000:00:02.0: [drm] Resetting rcs0 for preemption time out
Aug 07 20:39:05.489657 kernel: i915 0000:00:02.0: [drm] kwin_x11[805] context reset due to GPU hang
Aug 07 20:39:05.490295 kernel: i915 0000:00:02.0: [drm:__i915_request_reset.cold [i915]] context kwin_x11[805]: guilty 1, banned
Aug 07 20:39:05.490926 kernel: i915 0000:00:02.0: [drm:__i915_request_reset.cold [i915]] client kwin_x11[805]: gained 3 ban score, now 3
Aug 07 20:39:05.497207 kernel: i915 0000:00:02.0: [drm] GPU HANG: ecode 9:1:85d7fffb, in kwin_x11 [805]
How often does the steps listed above trigger the issue
Frequency of (PC freezes by unknown reason (serie of sequential GPU hangs suspected) or GPU hangs logged in systemd journal) are near highest possible. It could happen on logon screen without any user activity or during GUI session actions: on a first or 5th or 40th minute. Average is about 1-2 minutes. It is not a concrete exact action, it is general unexpected case and it did happen in (m)any types of typical user activity such as:
-) on logon screen (without any user action, even mouse touch; saw that for about 5-6 times);
-) moving desktop icons;
-) open start menu;
-) open context menu;
-) moving cursor in the text editor via keyboard navigation keys;
-) surfing in system settings window;
-) typing text in terminal emulator (GUI);
-) installing updates in GUI app or GUI terminal emulator;
-) text selection line-by-line in text editor or canceling selection in the Opera browser;
-) open or surfing in Opera web browser: list of gitlab commits viewing, filling a description of an issue ticket on this gitlab.freedesktop.org, watching youtube videos (not fullscreen and not even touch keyboard and mice at least for about last 1-2 minutes), extremely fast freeze/crash while surfing maps.google.com, maps.ya.ru,
etc.
Platform (CPU): Intel Core i5-8250U
System architecture: uname -m
: x86_64
Kernel version: uname -r
: 5.8.0-1-MANJARO
Linux distribution: Manjaro Linux (desktop environment: KDE)
Machine or motherboard model: Hystou Fanless Mini PC P03B-i5-8250U
Display connector: factory-made cable with connectors: HDMI
(connected to PC) - DVI-D
(connected to monitor)
Also:
KDE System Settings
has default Composer
settings.
The GPU settings file /etc/X11/xorg.conf.d/20-intel.conf
is empty.
Error data gathered in current hanged GUI user session (w/o switch into tty2 text mode) with the script collect_GPU_crash_data.zip, which collects:
# Collect main data
sudo cp /sys/class/drm/card0/error ...
sudo dmesg
journalctl -b -o short-precise --no-hostname --dmesg
journalctl -b -o short-precise --no-hostname
cat /proc/cmdline
# Collect supplementary data
xrandr --verbose
sudo dmidecode -t bios -t system -t baseboard -t chassis -t processor
mhwd -l -d
cp /etc/X11/xorg.conf.d/20-intel.conf ...
sudo lspci -vvv -G
sudo lspci -vvv -G -H1
sudo lspci -vvv -G -H2
lscpu
lsmod
modinfo i915
modinfo drm
modinfo drm_kms_helper
modinfo intel_gtt
modinfo i2c_algo_bit
sudo systool -v -m i915
sudo systool -v -m drm
sudo systool -v -m drm_kms_helper
sudo systool -v -m intel_gtt
sudo systool -v -m i2c_algo_bit
uname -m
uname -r
tty
inxi -CIGMxxx --no-host
/sys/class/drm/card0/error
file alone:
0_content_of__sys_class_drm_card0_error.zip
Whole gathered data (including the error
file above) are in the archive:
2020.08.07_-20.40.23_collected_data_of_GPU_crash-_GPU_hang.zip
The same script gathered the data but on the next boot while GPU was not hanged yet:
2020.08.07_-20.41.42_collected_data_of_GPU_crash-_the_next_boot__GPU_not_hanged_yet.zip