[2020.08.13] i915 GPU hang report on 5.8.1-2-MANJARO kernel
It is my ongoing 2-month long rally of PC freezes and GPU hangs. Now it is more than 200 cases. There are no a day without GPU hangs or PC freeze.
PC freeze or GPU hang usually happens while semi-transparent, fade in/out, blur effects is/are in action. I have a feeling that fast occurred serie of GPU hangs leads PC to freeze. If only one-two GPU hang happened 'at once' than PC may freeze or may not freeze.
Posted >30 reports of a GPU hang issue. It is daily reports already. The website's captcha engine already can't recognize me human am I or a bot and shows me it's tasks to complete. Switching to 4.19 kernel lowers the frequency of PC freezes, but PS is still almost unusable. Are there any chance to start to investigate the cause of problem? Can it be planned or posted rejection to investigate?
Since prev. report #2334 (closed) got these packages updates:
grep --text -iE 'installed|upgraded|removed' '/var/log/pacman.log' | tail -n 100
...
[2020-08-12T21:26:40+0000] [ALPM] upgraded pamac-common (9.5.7-3 -> 9.5.7-4)
[2020-08-12T21:26:40+0000] [ALPM] upgraded pamac-cli (9.5.7-3 -> 9.5.7-4)
[2020-08-12T21:26:40+0000] [ALPM] upgraded pamac-gtk (9.5.7-3 -> 9.5.7-4)
[2020-08-12T21:26:40+0000] [ALPM] upgraded pamac-snap-plugin (9.5.7-3 -> 9.5.7-4)
[2020-08-12T21:26:40+0000] [ALPM] upgraded pamac-tray-appindicator (9.5.7-3 -> 9.5.7-4)
Further ticket: #2342 (closed)
How the issue in this ticket happen
In Opera web browser on cs-online.club
page chooses server and start loading resources. May be I press a keybord key to lower the volume. Picture freezes for several seconds. Taskbar clock freezes on the 02:45:43
time moment (in HH:MM:SS format). Than picture un-hanged fully (taskbar works and window opens). I was able to execute (by a hot key) the script to collect error data.
journalctl -b -o short-precise --no-hostname --dmesg
excerpt:
Aug 13 02:45:54.793180 kernel: ------------[ cut here ]------------
Aug 13 02:45:54.793313 kernel: WARNING: CPU: 1 PID: 0 at kernel/sched/core.c:4488 default_wake_function+0x16/0x30
Aug 13 02:45:54.793412 kernel: Modules linked in: snd_seq_dummy snd_hrtimer snd_seq fuse hid_logitech_hidpp joydev mousedev input_leds hid_logitech_dj hid_generic usbhid intel_xhci_usb_role_switch roles snd_usb_audio snd_usbmidi_lib snd_hw
dep snd_rawmidi snd_seq_device x86_pkg_temp_thermal intel_powerclamp coretemp mc rfkill kvm_intel snd_pcm squashfs snd_timer kvm i915 snd ee1004 iTCO_wdt intel_pmc_bxt iTCO_vendor_support irqbypass loop soundcore intel_rapl_msr crct10dif_p
clmul intel_wmi_thunderbolt crc32_pclmul ghash_clmulni_intel aesni_intel crypto_simd nls_iso8859_1 cryptd glue_helper nls_cp437 i2c_algo_bit rapl vfat intel_cstate fat r8169 drm_kms_helper intel_uncore i2c_i801 realtek cec pcspkr i2c_smbus
libphy rc_core intel_gtt processor_thermal_device syscopyarea sysfillrect intel_rapl_common intel_pch_thermal sysimgblt intel_soc_dts_iosf fb_sys_fops wmi int3403_thermal int340x_thermal_zone bmc150_accel_i2c bmc150_accel_core industriali
o_triggered_buffer i2c_hid kfifo_buf hid evdev industrialio mac_hid
Aug 13 02:45:54.796670 kernel: int3400_thermal acpi_thermal_rel drm sg crypto_user agpgart ip_tables x_tables ext4 crc32c_generic crc16 mbcache jbd2 xhci_pci xhci_pci_renesas crc32c_intel xhci_hcd
Aug 13 02:45:54.796826 kernel: CPU: 1 PID: 0 Comm: swapper/1 Not tainted 5.8.1-2-MANJARO #1
Aug 13 02:45:54.796923 kernel: Hardware name: Default string Default string/Default string, BIOS 5.12 11/10/2018
Aug 13 02:45:54.797009 kernel: RIP: 0010:default_wake_function+0x16/0x30
Aug 13 02:45:54.797092 kernel: Code: e8 3f 87 3e 00 eb 99 0f 0b 66 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 44 00 00 f7 c2 fe ff ff ff 75 09 48 8b 7f 08 e9 0a f9 ff ff <0f> 0b 48 8b 7f 08 e9 ff f8 ff ff 66 66 2e 0f 1f 84 00 00 00 00 00
Aug 13 02:45:54.797188 kernel: RSP: 0018:ffff9f2a4013ce58 EFLAGS: 00010082
Aug 13 02:45:54.797269 kernel: RAX: ffffffffa32e4c60 RBX: ffff9f2a4046fd30 RCX: ffff9f2a4013ce70
Aug 13 02:45:54.797349 kernel: RDX: 00000000ffffff92 RSI: 0000000000000003 RDI: ffff9f2a4046fd30
Aug 13 02:45:54.797440 kernel: RBP: ffff8bc93995bd68 R08: 000000000001e303 R09: 0000000000000001
Aug 13 02:45:54.797541 kernel: R10: 0000000000000000 R11: 0000000000000001 R12: 0000000000000046
Aug 13 02:45:54.797622 kernel: R13: ffff8bc93995bd60 R14: ffff9f2a4013ce70 R15: ffff8bc89bec1828
Aug 13 02:45:54.797718 kernel: FS: 0000000000000000(0000) GS:ffff8bc941a80000(0000) knlGS:0000000000000000
Aug 13 02:45:54.797804 kernel: CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
Aug 13 02:45:54.797898 kernel: CR2: 00007f1d8b750000 CR3: 0000000642c0a005 CR4: 00000000003606e0
Aug 13 02:45:54.798011 kernel: Call Trace:
Aug 13 02:45:54.798099 kernel: <IRQ>
Aug 13 02:45:54.798180 kernel: autoremove_wake_function+0xe/0x30
Aug 13 02:45:54.798270 kernel: __i915_sw_fence_complete+0x156/0x1b0 [i915]
Aug 13 02:45:54.798363 kernel: ? i915_sw_fence_complete+0x20/0x20 [i915]
Aug 13 02:45:54.798454 kernel: ? i915_sw_fence_complete+0x20/0x20 [i915]
Aug 13 02:45:54.798528 kernel: call_timer_fn+0x2d/0x160
Aug 13 02:45:54.798631 kernel: ? i915_sw_fence_complete+0x20/0x20 [i915]
Aug 13 02:45:54.798706 kernel: __run_timers+0x130/0x290
Aug 13 02:45:54.798793 kernel: run_timer_softirq+0x2b/0x50
Aug 13 02:45:54.798874 kernel: __do_softirq+0x10f/0x352
Aug 13 02:45:54.798969 kernel: asm_call_on_stack+0x12/0x20
Aug 13 02:45:54.799060 kernel: </IRQ>
Aug 13 02:45:54.799156 kernel: do_softirq_own_stack+0x5f/0x80
Aug 13 02:45:54.799248 kernel: irq_exit_rcu+0xcb/0x120
Aug 13 02:45:54.799354 kernel: sysvec_apic_timer_interrupt+0x46/0xe0
Aug 13 02:45:54.799448 kernel: asm_sysvec_apic_timer_interrupt+0x12/0x20
Aug 13 02:45:54.799540 kernel: RIP: 0010:cpuidle_enter_state+0xb6/0x420
How often GPU of PC freezes happens
Frequency of (PC freezes by unknown reason (serie of sequential GPU hangs suspected) or GPU hangs logged in systemd journal) are near highest possible. It could happen on logon screen without any user activity or during GUI session actions: on a first or 5th or 40th minute. Average is about 2-3 minutes. It is not a concrete exact action, it is general unexpected case and it did happen in (m)any types of typical user activity such as:
-) on logon screen (without any user action, even mouse touch; saw that for about 7-8 times);
-) moving desktop icons;
-) open start menu;
-) open context menu;
-) moving cursor in the text editor via keyboard navigation keys;
-) surfing in system settings window;
-) typing text in terminal emulator (GUI);
-) installing updates in GUI app or GUI terminal emulator;
-) text selection line-by-line in text editor or canceling selection in the Opera browser;
-) open or surfing in Opera web browser: list of gitlab commits viewing, filling a description of an issue ticket on this gitlab.freedesktop.org, watching youtube videos (not fullscreen and not even touch keyboard and mice at least for about last 1-2 minutes), extremely fast freeze/crash while surfing maps.google.com, maps.ya.ru;
-) LiveCD GUI sessions;
etc.
Platform (CPU): Intel Core i5-8250U
System architecture: uname -m
: x86_64
Kernel version: uname -r
: 5.8.1-2-MANJARO
Linux distribution: Manjaro Linux (desktop environment: KDE)
Machine or motherboard model: Hystou Fanless Mini PC P03B-i5-8250U
Display connector: factory-made cable with connectors: HDMI
(connected to PC) - DVI-D
(connected to monitor)
Error data gathered in current hanged GUI user session (w/o switch into tty2 text mode) with the script collect_GPU_hang_data.zip, which collects:
# Collect main data
sudo cp /sys/class/drm/card0/error ...
sudo dmesg
journalctl -b -o short-precise --no-hostname --dmesg
cat /proc/cmdline
# Collect supplementary data
xrandr --verbose
sudo dmidecode -t bios -t system -t baseboard -t chassis -t processor
mhwd -l -d
cp /etc/X11/xorg.conf.d/20-intel.conf ...
sudo lspci -vvv -G
sudo lspci -vvv -G -H1
sudo lspci -vvv -G -H2
lscpu
lsmod
modinfo i915
modinfo drm
modinfo drm_kms_helper
modinfo intel_gtt
modinfo i2c_algo_bit
sudo systool -v -m i915
sudo systool -v -m drm
sudo systool -v -m drm_kms_helper
sudo systool -v -m intel_gtt
sudo systool -v -m i2c_algo_bit
uname -m
uname -r
tty
inxi -CIGMxxx --no-host
/sys/class/drm/card0/error
file alone:
0_content_of__sys_class_drm_card0_error.zip
Whole gathered data (including the error
file above) are in the archive: