[2020.08.07] i915 GPU hang report on 5.8.0-1-MANJARO kernel
This is another one case of GPU hang on the same PC (HW + Linux distro). It is my ongoing 1.5-month long rally of PC freezes and GPU hangs.
Since prev. report #2305 (closed) got these packages updates:
grep --text -iE 'installed|upgraded' '/var/log/pacman.log' | tail -n 50
...
[2020-08-07T00:57:20+0000] [ALPM] upgraded pamac-common (9.5.6-2 -> 9.5.6-3)
[2020-08-07T00:57:20+0000] [ALPM] upgraded pamac-cli (9.5.6-2 -> 9.5.6-3)
[2020-08-07T00:57:20+0000] [ALPM] upgraded pamac-gtk (9.5.6-2 -> 9.5.6-3)
[2020-08-07T00:57:20+0000] [ALPM] upgraded pamac-snap-plugin (9.5.6-2 -> 9.5.6-3)
[2020-08-07T00:57:20+0000] [ALPM] upgraded pamac-tray-appindicator (9.5.6-2 -> 9.5.6-3)
The next issue: #2307 (closed)
My PC experienced about >100 times of (PC freezes + GPU hangs) during last 6 weeks on every kernel 'family' (4.19, 5.4, 5.7, 5.8-rc) avail. in the distro. 4.19 looks like more stable and usually (but far away from always) able to reset GPU and to continue to work without the PC reboot. The more modern kernel version the much faster GPU hangs without any software reset (which 4.19 kernel can do) or PC freezes.
PC freeze or GPU hang usually happens while semi-transparent, fade in/out, blur effects is/are in action.
I have a feeling that fast occurred serie of GPU hangs leads PC to freeze. If only one-two GPU hang happened 'at once' than PC may freeze or may not freeze.
Steps to reproduce the issue in this ticket
After OS was loaded I entered user session. I started to move desktop icons to random positions. Moving by 1 icon or by selected groups. made about 40 times of meaningless movements. After than I pressed Ctrl+Alt+Del semi-transparent menu just start to show, freezes for several seconds and picture changes to 3 images on a monitor: fully black background --(after 0.6-0.8 sec)--> 1/4 of screen was filled with relatively low size color rectangles --(another 0.6-0.8 delay)--> full screen of that rectangles (see not-in-focus-photo below). And that 3 images exchanging in infinite cycle. By pressing my hotkey I was able to execute the script to collect the error data. Pictures on monitor was in that cycle until black screen of PC reboot via software (it is the last line in the script).
journalctl
excerpt:
Aug 07 04:42:09.005041 kernel: ------------[ cut here ]------------
Aug 07 04:42:09.005161 kernel: WARNING: CPU: 1 PID: 0 at kernel/sched/core.c:4488 default_wake_function+0x16/0x30
Aug 07 04:42:09.005251 kernel: Modules linked in: snd_seq_dummy snd_hrtimer snd_seq fuse hid_logitech_hidpp input_leds joydev mousedev hid_logitech_dj hid_generic usbhid intel_xhci_usb_role_switch roles i915 x86_pkg_temp_thermal intel_powerclamp coretemp kvm_intel snd_usb_audio iTCO_wdt intel_pmc_bxt ee1004 rfkill iTCO_vendor_support squashfs snd_usbmidi_lib snd_hwdep intel_wmi_thunderbolt intel_rapl_msr kvm snd_rawmidi snd_seq_device mc snd_pcm snd_timer snd soundcore loop irqbypass crct10dif_pclmul i2c_algo_bit crc32_pclmul ghash_clmulni_intel aesni_intel crypto_simd nls_iso8859_1 cryptd glue_helper rapl nls_cp437 drm_kms_helper vfat r8169 intel_cstate fat i2c_i801 realtek cec processor_thermal_device intel_uncore rc_core intel_gtt syscopyarea sysfillrect sysimgblt pcspkr intel_rapl_common i2c_smbus libphy fb_sys_fops intel_soc_dts_iosf intel_pch_thermal wmi int3403_thermal int340x_thermal_zone bmc150_accel_i2c bmc150_accel_core industrialio_triggered_buffer i2c_hid hid kfifo_buf industrialio evdev
Aug 07 04:42:09.005459 kernel: int3400_thermal mac_hid acpi_thermal_rel drm sg crypto_user agpgart ip_tables x_tables ext4 crc32c_generic crc16 mbcache jbd2 xhci_pci xhci_pci_renesas crc32c_intel xhci_hcd
Aug 07 04:42:09.005550 kernel: CPU: 1 PID: 0 Comm: swapper/1 Not tainted 5.8.0-1-MANJARO #1
Aug 07 04:42:09.005643 kernel: Hardware name: Default string Default string/Default string, BIOS 5.12 11/10/2018
Aug 07 04:42:09.005720 kernel: RIP: 0010:default_wake_function+0x16/0x30
Aug 07 04:42:09.005812 kernel: Code: e8 6f de 3d 00 eb 99 0f 0b 66 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 44 00 00 f7 c2 fe ff ff ff 75 09 48 8b 7f 08 e9 0a f9 ff ff <0f> 0b 48 8b 7f 08 e9 ff f8 ff ff 66 66 2e 0f 1f 84 00 00 00 00 00
Aug 07 04:42:09.005898 kernel: RSP: 0018:ffffae1d0013ce58 EFLAGS: 00010082
Aug 07 04:42:09.005991 kernel: RAX: ffffffffb5ee4c40 RBX: ffffae1d00f83d30 RCX: ffffae1d0013ce70
Aug 07 04:42:09.006083 kernel: RDX: 00000000ffffff92 RSI: 0000000000000003 RDI: ffffae1d00f83d30
Aug 07 04:42:09.006167 kernel: RBP: ffff976b1b502568 R08: 000000000001516f R09: 0000000000000001
Aug 07 04:42:09.006243 kernel: R10: 0000000000000000 R11: 0000000000000001 R12: 0000000000000046
Aug 07 04:42:09.006324 kernel: R13: ffff976b1b502560 R14: ffffae1d0013ce70 R15: ffff976aede80ea8
Aug 07 04:42:09.006399 kernel: FS: 0000000000000000(0000) GS:ffff976b41a80000(0000) knlGS:0000000000000000
Aug 07 04:42:09.006489 kernel: CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
Aug 07 04:42:09.006592 kernel: CR2: 00007fa7fc122918 CR3: 000000030320a004 CR4: 00000000003606e0
Aug 07 04:42:09.006664 kernel: Call Trace:
Aug 07 04:42:09.006756 kernel: <IRQ>
Aug 07 04:42:09.006837 kernel: autoremove_wake_function+0xe/0x30
Aug 07 04:42:09.006934 kernel: __i915_sw_fence_complete+0x156/0x1b0 [i915]
Aug 07 04:42:09.007018 kernel: ? i915_sw_fence_complete+0x20/0x20 [i915]
Aug 07 04:42:09.007109 kernel: ? i915_sw_fence_complete+0x20/0x20 [i915]
Aug 07 04:42:09.007187 kernel: call_timer_fn+0x2d/0x160
Aug 07 04:42:09.007298 kernel: ? i915_sw_fence_complete+0x20/0x20 [i915]
Aug 07 04:42:09.007362 kernel: __run_timers+0x130/0x290
Aug 07 04:42:09.007463 kernel: run_timer_softirq+0x2b/0x50
Aug 07 04:42:09.007546 kernel: __do_softirq+0x10f/0x352
Aug 07 04:42:09.007627 kernel: asm_call_on_stack+0x12/0x20
Aug 07 04:42:09.007704 kernel: </IRQ>
Aug 07 04:42:09.007799 kernel: do_softirq_own_stack+0x5f/0x80
Aug 07 04:42:09.007876 kernel: irq_exit_rcu+0xcb/0x120
Aug 07 04:42:09.007959 kernel: sysvec_apic_timer_interrupt+0x46/0xe0
Aug 07 04:42:09.008055 kernel: asm_sysvec_apic_timer_interrupt+0x12/0x20
Aug 07 04:42:09.008139 kernel: RIP: 0010:cpuidle_enter_state+0xb6/0x420
Aug 07 04:42:09.008215 kernel: Code: 80 76 a2 49 e8 5b 3d 8e ff 49 89 c7 0f 1f 44 00 00 31 ff e8 8c 4b 8e ff 80 7c 24 0f 00 0f 85 06 02 00 00 fb 66 0f 1f 44 00 00 <45> 85 e4 0f 88 e9 01 00 00 49 63 d4 4c 2b 7c 24 10 48 8d 04 52 48
Aug 07 04:42:09.008320 kernel: RSP: 0018:ffffae1d000e7e78 EFLAGS: 00000246
Aug 07 04:42:09.008404 kernel: RAX: ffff976b41a80000 RBX: ffff976b41ab6800 RCX: 000000000000001f
Aug 07 04:42:09.008490 kernel: RDX: 0000000000000000 RSI: ffffffffb716a0b2 RDI: ffffffffb7149f8f
Aug 07 04:42:09.008581 kernel: RBP: ffffffffb74c9bc0 R08: 00000030dc13270f R09: 0000000000000006
Aug 07 04:42:09.008665 kernel: R10: 00000000000026ce R11: 00000000000026cc R12: 0000000000000008
Aug 07 04:42:09.008747 kernel: R13: ffff976b41ab6800 R14: 0000000000000008 R15: 00000030dc13270f
Aug 07 04:42:09.008837 kernel: ? cpuidle_enter_state+0xa4/0x420
Aug 07 04:42:09.008914 kernel: cpuidle_enter+0x29/0x40
Aug 07 04:42:09.008996 kernel: do_idle+0x1fb/0x2c0
Aug 07 04:42:09.009205 kernel: cpu_startup_entry+0x19/0x20
Aug 07 04:42:09.009299 kernel: start_secondary+0x178/0x1c0
Aug 07 04:42:09.009376 kernel: secondary_startup_64+0xb6/0xc0
Aug 07 04:42:09.009451 kernel: ---[ end trace d6bf14257c2faa58 ]---
Aug 07 04:42:09.023981 kernel: [drm:drm_atomic_state_default_clear [drm]] Clearing atomic state 00000000cc4097cb
Aug 07 04:42:09.024183 kernel: [drm:__drm_atomic_state_free [drm]] Freeing atomic state 00000000cc4097cb
Aug 07 04:42:10.525146 kernel: i915 0000:00:02.0: [drm] Resetting rcs0 for preemption time out
Aug 07 04:42:10.526185 kernel: i915 0000:00:02.0: [drm] kwin_x11[1380] context reset due to GPU hang
Aug 07 04:42:10.526836 kernel: i915 0000:00:02.0: [drm:__i915_request_reset.cold [i915]] context kwin_x11[1380]: guilty 1, banned
Aug 07 04:42:10.527551 kernel: i915 0000:00:02.0: [drm:__i915_request_reset.cold [i915]] client kwin_x11[1380]: gained 3 ban score, now 3
Aug 07 04:42:10.533131 kernel: i915 0000:00:02.0: [drm] GPU HANG: ecode 9:1:85dffffb, in kwin_x11 [1380]
How often does the steps listed above trigger the issue
Frequency of the PC freeze or GPU hang are near highest possible. It could happen on logon screen without any user activity or during GUI session actions: on a first or 5th or 40th minute. Average is about 1-2 minutes. It is not a concrete exact action, it is general unexpected case and it did happen in (m)any types of typical user activity such as:
-) on logon screen (without any user action, even mouse touch; saw that for about 2-3 times),
-) moving desktop icons,
-) open start menu,
-) open context menu,
-) moving cursor in the text editor via keyboard navigation keys,
-) surfing in system settings window,
-) typing text in terminal emulator (GUI),
-) installing updates in GUI app or GUI terminal emulator,
-) open or surfing in Opera web browser: list of gitlab commits viewing, watching youtube videos (not fullscreen and not even touch keyboard and mice at least for about last 1-2 minutes), extremely fast freeze/crash while surfing maps.google.com, maps.ya.ru,
etc.
Platform (CPU): Intel Core i5-8250U
System architecture: uname -m
: x86_64
Kernel version: uname -r
: 5.8.0-1-MANJARO
Linux distribution: Manjaro Linux (desktop environment: KDE)
Machine or motherboard model: Hystou Fanless Mini PC P03B-i5-8250U
Display connector: factory-made cable with connectors: HDMI
(connected to PC) - DVI-D
(connected to monitor)
Also:
KDE System Settings
has default Composer
settings.
The GPU settings file /etc/X11/xorg.conf.d/20-intel.conf
is empty.
Error data gathered from within that hanged GUI session (without switch to tty2) by pressing a custom global hotkey of KDE, which executes the script collect_GPU_crash_data.zip, which collects:
# Collect main data
sudo cp /sys/class/drm/card0/error ...
sudo dmesg
journalctl -b -o short-precise --no-hostname --dmesg
journalctl -b -o short-precise --no-hostname
cat /proc/cmdline
# Collect supplementary data
xrandr --verbose
sudo dmidecode -t bios -t system -t baseboard -t chassis -t processor
mhwd -l -d
cp /etc/X11/xorg.conf.d/20-intel.conf ...
sudo lspci -vvv -G
sudo lspci -vvv -G -H1
sudo lspci -vvv -G -H2
lscpu
lsmod
modinfo i915
modinfo drm
modinfo drm_kms_helper
modinfo intel_gtt
modinfo i2c_algo_bit
sudo systool -v -m i915
sudo systool -v -m drm
sudo systool -v -m drm_kms_helper
sudo systool -v -m intel_gtt
sudo systool -v -m i2c_algo_bit
uname -m
uname -r
inxi -CIGMxxx --no-host
/sys/class/drm/card0/error
file alone:
0_content_of__sys_class_drm_card0_error.zip
Whole gathered data (including the error
file above) are in the archive:
2020.08.07_-04.42.54_collected_data_of_GPU_crash-_GPU_hanged.zip
Also there are data collected by the same script while 'clean GPU work state' (with still not hanged GPU):
2020.08.07_-04.45.35_collected_data_of_GPU_crash-_a_further_boot_while_GPU_hang_not_happen.zip
Also while this boot I got an errors dumps in /var/lib/systemd/coredump/
path:
core.kglobalaccel5.1000.4bae0d81c08a455e8836c3bcfc705a93.1273.1596775185000000000000.zst
core.kwin_x11.1000.4bae0d81c08a455e8836c3bcfc705a93.820.1596775185000000000000.zst