[2020.08.05] i915 GPU hang report on 5.8.0-1-MANJARO kernel
It is another one case of GPU hang on the same PC (HW + Linux distro). It is my 1.5-month long ongoing rally.
Since prev. report #2187 (comment 586608) got these packages updates:
grep --text -iE 'installed|upgraded' '/var/log/pacman.log' | tail -n 50
...
[2020-08-02T11:05:04+0000] [ALPM] upgraded imagemagick (7.0.10.24-2 -> 7.0.10.25-1)
[2020-08-02T11:05:04+0000] [ALPM] upgraded lib32-libx11 (1.6.9-1 -> 1.6.10-1)
[2020-08-02T11:05:04+0000] [ALPM] upgraded re2 (1:20200706-1 -> 1:20200801-1)
[2020-08-03T10:05:32+0000] [ALPM] upgraded python-urwid (2.1.0-2 -> 2.1.1-1)
[2020-08-03T14:21:40+0000] [ALPM] upgraded linux58 (5.8rc7.d0731.g7dc6fd0-1 -> 5.8.0-1)
[2020-08-03T16:32:21+0000] [ALPM] installed peg (0.1.18-2)
[2020-08-03T16:32:21+0000] [ALPM] installed intel-gpu-tools (1.25-2)
[2020-08-04T13:26:15+0000] [ALPM] upgraded libmfx (20.2.0-1 -> 20.2.1-1)
[2020-08-04T13:26:15+0000] [ALPM] upgraded libx11 (1.6.10-1 -> 1.6.10-2)
[2020-08-04T13:26:16+0000] [ALPM] upgraded linux-firmware (20200721.r1678.2b823fc-1 -> 20200803.r1680.9bc3789-1)
[2020-08-04T13:26:17+0000] [ALPM] upgraded python-setuptools (1:49.2.0-1 -> 1:49.2.1-1)
[2020-08-05T03:06:39+0000] [ALPM] upgraded libcap (2.37-1 -> 2.38-1)
[2020-08-05T03:06:39+0000] [ALPM] upgraded lib32-libcap (2.37-1 -> 2.38-1)
[2020-08-05T03:06:39+0000] [ALPM] upgraded libgusb (0.3.4-1 -> 0.3.5-1)
[2020-08-05T03:06:39+0000] [ALPM] upgraded libvpx (1.8.2-2 -> 1.9.0-1)
[2020-08-05T03:06:39+0000] [ALPM] upgraded mpg123 (1.26.1-1 -> 1.26.2-1)
[2020-08-05T03:06:39+0000] [ALPM] upgraded vulkan-icd-loader (1.2.147-1 -> 1.2.148-1)
The next issue is: #2305 (closed)
My PC experienced about >100 times of (PC freezes + GPU hangs) during last 6 weeks on every kernel 'family' (4.19, 5.4, 5.7, 5.8-rc) avail. in the distro. 4.19 looks like more stable and able to reset GPU and to continue to work. The more modern kernel version the much faster GPU hangs without any software reset (which 4.19 kernel can do) or PC freezes.
PC freeze or GPU hang usually happens while semi-transparent, fade in/out, blur effects is/are in action.
I have a feeling that fast occured serie of GPU hangs leads PC to freeze. If only one-two GPU hang happened 'at once' than PC may freeze or may not freeze.
Steps to reproduce the issue in this ticket
Here after enter to user session I got typical plasma system notification about what network is connected (it appears on every boot) and while it is not gone by timeout I start to play with random pressing of Ctrl+Alt+Del
, Esc
, Meta
keys in try to load GPU engine more than static picture by adding transparency/layers of rendering of Logout and Start menu of KDE. In about in 2nd second I got picture hang for several seconds and then got bitty/scattered colored rectangles on a black background instead of typical picture of desktop space with icons, taskbar, etc.
journalctl
excerpt:
Aug 05 05:16:38.944596 kernel: ------------[ cut here ]------------
Aug 05 05:16:38.944700 kernel: WARNING: CPU: 0 PID: 0 at kernel/sched/core.c:4488 default_wake_function+0x16/0x30
Aug 05 05:16:38.944779 kernel: Modules linked in: snd_seq_dummy snd_hrtimer snd_seq fuse hid_logitech_hidpp input_leds mousedev joydev snd_usb_audio hid_logitech_dj snd_usbmidi_lib snd_hwdep snd_rawmidi snd_seq_device mc snd_pcm snd_timer snd soundcore hid_generic usbhid i915 rfkill squashfs x86_pkg_temp_thermal intel_powerclamp coretemp kvm_intel kvm loop iTCO_wdt intel_pmc_bxt ee1004 iTCO_vendor_support irqbypass crct10dif_pclmul crc32_pclmul i2c_algo_bit ghash_clmulni_intel nls_iso8859_1 aesni_intel drm_kms_helper intel_rapl_msr intel_wmi_thunderbolt nls_cp437 vfat crypto_simd fat cryptd glue_helper rapl cec intel_cstate rc_core intel_uncore r8169 realtek intel_gtt i2c_i801 pcspkr syscopyarea sysfillrect i2c_smbus processor_thermal_device sysimgblt intel_xhci_usb_role_switch intel_rapl_common libphy fb_sys_fops roles intel_soc_dts_iosf intel_pch_thermal wmi bmc150_accel_i2c bmc150_accel_core int3403_thermal int340x_thermal_zone industrialio_triggered_buffer kfifo_buf i2c_hid hid int3400_thermal
Aug 05 05:16:38.951998 kernel: industrialio evdev acpi_thermal_rel mac_hid drm sg crypto_user agpgart ip_tables x_tables ext4 crc32c_generic crc16 mbcache jbd2 xhci_pci xhci_pci_renesas xhci_hcd crc32c_intel
Aug 05 05:16:38.952129 kernel: CPU: 0 PID: 0 Comm: swapper/0 Not tainted 5.8.0-1-MANJARO #1
Aug 05 05:16:38.952222 kernel: Hardware name: Default string Default string/Default string, BIOS 5.12 11/10/2018
Aug 05 05:16:38.952308 kernel: RIP: 0010:default_wake_function+0x16/0x30
Aug 05 05:16:38.952401 kernel: Code: e8 6f de 3d 00 eb 99 0f 0b 66 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 44 00 00 f7 c2 fe ff ff ff 75 09 48 8b 7f 08 e9 0a f9 ff ff <0f> 0b 48 8b 7f 08 e9 ff f8 ff ff 66 66 2e 0f 1f 84 00 00 00 00 00
Aug 05 05:16:38.952478 kernel: RSP: 0018:ffff921b40003e58 EFLAGS: 00010082
Aug 05 05:16:38.952560 kernel: RAX: ffffffffafee4c40 RBX: ffff921b40b17d30 RCX: ffff921b40003e70
Aug 05 05:16:38.952654 kernel: RDX: 00000000ffffff92 RSI: 0000000000000003 RDI: ffff921b40b17d30
Aug 05 05:16:38.952727 kernel: RBP: ffff8f62eb67ed68 R08: 0000000000000bb8 R09: 0000000000000001
Aug 05 05:16:38.952809 kernel: R10: 0000000000000000 R11: 0000000000000001 R12: 0000000000000046
Aug 05 05:16:38.952882 kernel: R13: ffff8f62eb67ed60 R14: ffff921b40003e70 R15: ffff8f62dc4ea228
Aug 05 05:16:38.952963 kernel: FS: 0000000000000000(0000) GS:ffff8f6301a00000(0000) knlGS:0000000000000000
Aug 05 05:16:38.953037 kernel: CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
Aug 05 05:16:38.953119 kernel: CR2: 00007f34eaeb6b90 CR3: 00000004fbc0a001 CR4: 00000000003606f0
Aug 05 05:16:38.953201 kernel: Call Trace:
Aug 05 05:16:38.953282 kernel: <IRQ>
Aug 05 05:16:38.953362 kernel: autoremove_wake_function+0xe/0x30
Aug 05 05:16:38.953435 kernel: __i915_sw_fence_complete+0x156/0x1b0 [i915]
Aug 05 05:16:38.953554 kernel: ? i915_sw_fence_complete+0x20/0x20 [i915]
Aug 05 05:16:38.953650 kernel: ? i915_sw_fence_complete+0x20/0x20 [i915]
Aug 05 05:16:38.953717 kernel: call_timer_fn+0x2d/0x160
Aug 05 05:16:38.953799 kernel: ? i915_sw_fence_complete+0x20/0x20 [i915]
Aug 05 05:16:38.953857 kernel: __run_timers+0x130/0x290
Aug 05 05:16:38.953928 kernel: run_timer_softirq+0x2b/0x50
Aug 05 05:16:38.953999 kernel: __do_softirq+0x10f/0x352
Aug 05 05:16:38.954070 kernel: asm_call_on_stack+0x12/0x20
Aug 05 05:16:38.954156 kernel: </IRQ>
Aug 05 05:16:38.954248 kernel: do_softirq_own_stack+0x5f/0x80
Aug 05 05:16:38.954340 kernel: irq_exit_rcu+0xcb/0x120
Aug 05 05:16:38.954416 kernel: sysvec_apic_timer_interrupt+0x46/0xe0
Aug 05 05:16:38.954489 kernel: asm_sysvec_apic_timer_interrupt+0x12/0x20
Aug 05 05:16:38.954569 kernel: RIP: 0010:cpuidle_enter_state+0xb6/0x420
Aug 05 05:16:38.954648 kernel: Code: 80 76 a2 4f e8 5b 3d 8e ff 49 89 c7 0f 1f 44 00 00 31 ff e8 8c 4b 8e ff 80 7c 24 0f 00 0f 85 06 02 00 00 fb 66 0f 1f 44 00 00 <45> 85 e4 0f 88 e9 01 00 00 49 63 d4 4c 2b 7c 24 10 48 8d 04 52 48
Aug 05 05:16:38.954722 kernel: RSP: 0018:ffffffffb1403e40 EFLAGS: 00000246
Aug 05 05:16:38.954808 kernel: RAX: ffff8f6301a00000 RBX: ffff8f6301a36800 RCX: 000000000000001f
Aug 05 05:16:38.954880 kernel: RDX: 0000000000000000 RSI: ffffffffb116a0b2 RDI: ffffffffb1149f8f
Aug 05 05:16:38.954953 kernel: RBP: ffffffffb14c9bc0 R08: 000000063e56e6ad R09: 0000000000000020
Aug 05 05:16:38.955043 kernel: R10: 000000000001b927 R11: 000000000005b174 R12: 0000000000000008
Aug 05 05:16:38.955128 kernel: R13: ffff8f6301a36800 R14: 0000000000000008 R15: 000000063e56e6ad
Aug 05 05:16:38.955203 kernel: cpuidle_enter+0x29/0x40
Aug 05 05:16:38.955274 kernel: do_idle+0x1fb/0x2c0
Aug 05 05:16:38.955346 kernel: cpu_startup_entry+0x19/0x20
Aug 05 05:16:38.955417 kernel: start_kernel+0x843/0x868
Aug 05 05:16:38.955488 kernel: secondary_startup_64+0xb6/0xc0
Aug 05 05:16:38.955553 kernel: ---[ end trace e6805eef7cbbdab8 ]---
Aug 05 05:16:38.955625 kernel: [drm:drm_atomic_state_default_clear [drm]] Clearing atomic state 0000000027bf8d91
Aug 05 05:16:38.955711 kernel: [drm:__drm_atomic_state_free [drm]] Freeing atomic state 0000000027bf8d91
Aug 05 05:16:41.531436 kernel: i915 0000:00:02.0: [drm] Resetting rcs0 for preemption time out
Aug 05 05:16:41.532317 kernel: i915 0000:00:02.0: [drm] kwin_x11[786] context reset due to GPU hang
Aug 05 05:16:41.532930 kernel: i915 0000:00:02.0: [drm:__i915_request_reset.cold [i915]] context kwin_x11[786]: guilty 1, banned
Aug 05 05:16:41.533563 kernel: i915 0000:00:02.0: [drm:__i915_request_reset.cold [i915]] client kwin_x11[786]: gained 4 ban score, now 4
Aug 05 05:16:41.550189 kernel: i915 0000:00:02.0: [drm] GPU HANG: ecode 9:1:85dffffb, in kwin_x11 [786]
How often does the steps listed above trigger the issue
Frequency of the PC freeze or GPU hang are near highest possible. It could happen on logon screen without any user activity or during GUI session actions: on a first or 5th or 10th minute. Average is about 1-2 minutes. It is not a concrete exact action, it is general unexpected case and it did happen in (m)any types of typical user activity such as:
-) on logon screen (without any user action, even mouse touch; saw that for about 2-3 times),
-) moving desktop icons,
-) open start menu,
-) open context menu,
-) moving cursor in the text editor via keyboard navigation keys,
-) surfing in system settings window,
-) typing text in terminal emulator (GUI),
-) installing updates in GUI app or GUI terminal emulator,
-) open or surfing in Opera web browser: list of gitlab commits viewing, watching youtube videos (not fullscreen and not even touch keyboard and mice at least for about last 1-2 minutes), extremely fast freeze/crash while surfing maps.google.com, maps.ya.ru,
etc.
Platform (CPU): Intel Core i5-8250U
System architecture: uname -m
: x86_64
Kernel version: uname -r
: 5.8.0-1-MANJARO
Linux distribution: Manjaro Linux (desktop environment: KDE)
Machine or motherboard model: Hystou Fanless Mini PC P03B-i5-8250U
Display connector: factory-made cable with connectors: HDMI
(connected to PC) - DVI-D
(connected to monitor)
Also:
KDE System Settings
has default Composer
settings.
The GPU settings file /etc/X11/xorg.conf.d/20-intel.conf
is empty.
Error data gathered from within that hanged GUI session (without switch to tty2) by pressing a custom global hotkey of KDE, which executes the script collect_GPU_crash_data.zip, which collects:
# Main data
sudo cp /sys/class/drm/card0/error ...
sudo dmesg
journalctl -b -o short-precise --no-hostname -k
journalctl -b -o short-precise --no-hostname
cat /proc/cmdline
# Supplementary data
xrandr --verbose
sudo dmidecode -t bios -t system -t baseboard -t chassis -t processor
mhwd -l -d
cp /etc/X11/xorg.conf.d/20-intel.conf ...
sudo lspci -vvv -G
sudo lspci -vvv -G -H1
sudo lspci -vvv -G -H2
lscpu
lsmod
modinfo i915
modinfo drm
sudo systool -v -m i915
sudo systool -v -m drm
inxi -CIGMxxx --no-host
/sys/class/drm/card0/error
file alone:
0_content_of__sys_class_drm_card0_error.zip
Whole gathered data (including the error
file above) are in the archive:
2020.08.05_-05.16.46_collected_data_of_GPU_crash-_GPU_hang.zip
Also there are data collected by the same script while 'clean GPU work state' (with still not hanged GPU):
2020.08.05_-05.17.27_collected_data_of_GPU_crash-_next_boot_while_no_GPU_hang_happen.zip