[2020.08.12-2] i915 GPU hang report on 5.8.1-2-MANJARO kernel
It is my ongoing 2-month long rally of PC freezes and GPU hangs. Now it is more than 200 cases. There are no a day without GPU hangs or PC freeze.
PC freeze or GPU hang usually happens while semi-transparent, fade in/out, blur effects is/are in action. I have a feeling that fast occurred serie of GPU hangs leads PC to freeze. If only one-two GPU hang happened 'at once' than PC may freeze or may not freeze.
Posted >30 reports of a GPU hang issue. It is daily reports already. The website's captcha engine already can't recognize me human am I or a bot and shows me it's tasks to complete. Switching to 4.19 kernel lowers the frequency of PC freezes, but PS is still almost unusable. Are there any chance to start to investigate the cause of problem? Can it be planned or posted rejection to investigate?
Since prev. report #2333 (closed) got these packages updates:
grep --text -iE 'installed|upgraded|removed' '/var/log/pacman.log' | tail -n 100
...
<no any updated since the prev. ticket>
Further ticket: #2341 (closed)
How the issue in this ticket happen
Open Opera web browser with 3-4 tabs open. They start to load and in 1-2 seconds after the Opera window appears picture freezes. I was able to execute (by a hot key) the script to collect error data. Taskbar clock freezes on the 18:25:57
time moment (in HH:MM:SS format).
journalctl -b -o short-precise --no-hostname --dmesg
excerpt:
Aug 12 18:25:58.253836 kernel: i915 0000:00:02.0: [drm:intel_plane_atomic_calc_changes [i915]] [CRTC:51:pipe A] with [PLANE:47:cursor A] visible 1 -> 1, off 0, on 0, ms 0
Aug 12 18:25:58.895227 kernel: i915 0000:00:02.0: [drm] Resetting rcs0 for preemption time out
Aug 12 18:25:58.896071 kernel: i915 0000:00:02.0: [drm] Xorg[575] context reset due to GPU hang
Aug 12 18:25:58.899703 kernel: i915 0000:00:02.0: [drm] GPU HANG: ecode 9:1:85dffffb, in Xorg [575]
Aug 12 18:25:58.899918 kernel: GPU hangs can indicate a bug anywhere in the entire gfx stack, including userspace.
Aug 12 18:25:58.899941 kernel: Please file a _new_ bug report at https://gitlab.freedesktop.org/drm/intel/issues/new.
Aug 12 18:25:58.899958 kernel: Please see https://gitlab.freedesktop.org/drm/intel/-/wikis/How-to-file-i915-bugs for details.
Aug 12 18:25:58.899978 kernel: drm/i915 developers can then reassign to the right component if it's not a kernel issue.
Aug 12 18:25:58.899993 kernel: The GPU crash dump is required to analyze GPU hangs, so please always attach it.
Aug 12 18:25:58.900011 kernel: GPU crash dump saved to /sys/class/drm/card0/error
Aug 12 18:25:58.900029 kernel: ------------[ cut here ]------------
Aug 12 18:25:58.900049 kernel: WARNING: CPU: 2 PID: 0 at kernel/sched/core.c:4488 default_wake_function+0x16/0x30
Aug 12 18:25:58.900064 kernel: Modules linked in: snd_seq_dummy snd_hrtimer snd_seq fuse hid_logitech_hidpp mousedev input_leds joydev hid_logitech_dj snd_usb_audio snd_usbmidi_lib snd_hwdep snd_rawmidi snd_seq_device mc snd_pcm snd_timer snd soundcore hid_generic usbhid intel_rapl_msr ee1004 iTCO_wdt x86_pkg_temp_thermal intel_powerclamp coretemp intel_pmc_bxt kvm_intel iTCO_vendor_support kvm intel_wmi_thunderbolt rfkill squashfs irqbypass i915 crct10dif_pclmul loop crc32_pclmul ghash_clmulni_intel aesni_intel nls_iso8859_1 crypto_simd nls_cp437 cryptd glue_helper rapl intel_cstate vfat i2c_algo_bit intel_uncore fat drm_kms_helper r8169 i2c_i801 cec pcspkr realtek i2c_smbus libphy rc_core intel_pch_thermal intel_gtt processor_thermal_device syscopyarea intel_rapl_common sysfillrect intel_xhci_usb_role_switch sysimgblt roles fb_sys_fops intel_soc_dts_iosf wmi int3403_thermal int340x_thermal_zone bmc150_accel_i2c bmc150_accel_core industrialio_triggered_buffer kfifo_buf industrialio i2c_hid hid evdev
Aug 12 18:25:58.902789 kernel: int3400_thermal mac_hid acpi_thermal_rel drm sg crypto_user agpgart ip_tables x_tables ext4 crc32c_generic crc16 mbcache jbd2 xhci_pci xhci_pci_renesas crc32c_intel xhci_hcd
Aug 12 18:25:58.902968 kernel: CPU: 2 PID: 0 Comm: swapper/2 Not tainted 5.8.1-2-MANJARO #1
Aug 12 18:25:58.903020 kernel: Hardware name: Default string Default string/Default string, BIOS 5.12 11/10/2018
Aug 12 18:25:58.903055 kernel: RIP: 0010:default_wake_function+0x16/0x30
Aug 12 18:25:58.903088 kernel: Code: e8 3f 87 3e 00 eb 99 0f 0b 66 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 44 00 00 f7 c2 fe ff ff ff 75 09 48 8b 7f 08 e9 0a f9 ff ff <0f> 0b 48 8b 7f 08 e9 ff f8 ff ff 66 66 2e 0f 1f 84 00 00 00 00 00
Aug 12 18:25:58.903125 kernel: RSP: 0018:ffffa7a7c0178d80 EFLAGS: 00010086
Aug 12 18:25:58.903156 kernel: RAX: ffffffff810e4c60 RBX: ffffa7a7c0493d30 RCX: ffffa7a7c0178d98
Aug 12 18:25:58.903183 kernel: RDX: 00000000fffffffb RSI: 0000000000000003 RDI: ffffa7a7c0493d30
Aug 12 18:25:58.903208 kernel: RBP: ffff98de5e7ec568 R08: 0000000000000001 R09: 0000000000000001
Aug 12 18:25:58.903233 kernel: R10: ffff98de56c29300 R11: 0000000000002400 R12: 0000000000000046
Aug 12 18:25:58.903258 kernel: R13: ffff98de5e7ec560 R14: ffffa7a7c0178d98 R15: ffff98de30a2c040
Aug 12 18:25:58.903283 kernel: FS: 0000000000000000(0000) GS:ffff98de81b00000(0000) knlGS:0000000000000000
Aug 12 18:25:58.903336 kernel: CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
Aug 12 18:25:58.903371 kernel: CR2: 0000263bdbd7a000 CR3: 000000039b40a006 CR4: 00000000003606e0
Aug 12 18:25:58.903408 kernel: Call Trace:
Aug 12 18:25:58.903442 kernel: <IRQ>
Aug 12 18:25:58.903473 kernel: autoremove_wake_function+0xe/0x30
Aug 12 18:25:58.903502 kernel: __i915_sw_fence_complete+0x156/0x1b0 [i915]
Aug 12 18:25:58.903532 kernel: dma_i915_sw_fence_wake_timer+0x2c/0x50 [i915]
Aug 12 18:25:58.903561 kernel: [drm:drm_atomic_set_fb_for_plane [drm]] Set [FB:120] for [PLANE:47:cursor A] state 000000002a62a13a
Aug 12 18:25:58.903602 kernel: signal_irq_work+0x23e/0x350 [i915]
Aug 12 18:25:58.903639 kernel: i915 0000:00:02.0: [drm:intel_plane_atomic_calc_changes [i915]] [CRTC:51:pipe A] with [PLANE:47:cursor A] visible 1 -> 1, off 0, on 0, ms 0
Aug 12 18:25:58.904032 kernel: irq_work_single+0x2c/0x40
Aug 12 18:25:58.904066 kernel: irq_work_run_list+0x2d/0x40
Aug 12 18:25:58.904097 kernel: irq_work_run+0x26/0x40
Aug 12 18:25:58.904143 kernel: __sysvec_irq_work+0x2d/0xf0
Aug 12 18:25:58.904183 kernel: sysvec_irq_work+0x41/0xe0
Aug 12 18:25:58.904222 kernel: asm_sysvec_irq_work+0x12/0x20
Aug 12 18:25:58.904254 kernel: RIP: 0010:__do_softirq+0x93/0x352
Aug 12 18:25:58.904284 kernel: Code: c7 44 24 28 0a 00 00 00 44 89 74 24 04 48 c7 c7 77 47 3e 82 e8 8e 09 c0 ff 65 66 c7 05 b4 ba 22 7e 00 00 fb 66 0f 1f 44 00 00 <48> c7 44 24 08 c0 50 60 82 b8 ff ff ff ff 0f bc 44 24 04 83 c0 01
Aug 12 18:25:58.904322 kernel: RSP: 0018:ffffa7a7c0178f90 EFLAGS: 00000292
Aug 12 18:25:58.904354 kernel: RAX: 0000000000000002 RBX: ffff98de7deb8000 RCX: 000000000000001f
Aug 12 18:25:58.904391 kernel: RDX: 0000000000000000 RSI: ffffffff823e4777 RDI: ffffffff8237bd66
Aug 12 18:25:58.904422 kernel: RBP: ffffa7a7c00efd60 R08: 00000016880571ec R09: 0000000000000000
Aug 12 18:25:58.904458 kernel: R10: 0000000000000000 R11: 0000000000000000 R12: ffff98de7b211c00
Aug 12 18:25:58.904490 kernel: R13: ffffffff8110c940 R14: 0000000000000001 R15: ffffa7a7c0179000
Aug 12 18:25:58.904525 kernel: ? handle_fasteoi_irq+0x210/0x210
Aug 12 18:25:58.904561 kernel: ? handle_irq_event+0x78/0xb0
Aug 12 18:25:58.904592 kernel: ? handle_fasteoi_irq+0x210/0x210
Aug 12 18:25:58.904622 kernel: asm_call_on_stack+0x12/0x20
Aug 12 18:25:58.904658 kernel: </IRQ>
Aug 12 18:25:58.904691 kernel: do_softirq_own_stack+0x5f/0x80
Aug 12 18:25:58.904722 kernel: irq_exit_rcu+0xcb/0x120
Aug 12 18:25:58.904754 kernel: common_interrupt+0xd1/0x200
Aug 12 18:25:58.904790 kernel: asm_common_interrupt+0x1e/0x40
Aug 12 18:25:58.904822 kernel: RIP: 0010:cpuidle_enter_state+0xb6/0x420
Aug 12 18:25:58.904859 kernel: Code: 50 a0 81 7e e8 4b 67 8d ff 49 89 c7 0f 1f 44 00 00 31 ff e8 7c 75 8d ff 80 7c 24 0f 00 0f 85 06 02 00 00 fb 66 0f 1f 44 00 00 <45> 85 e4 0f 88 e9 01 00 00 49 63 d4 4c 2b 7c 24 10 48 8d 04 52 48
Aug 12 18:25:58.904896 kernel: RSP: 0018:ffffa7a7c00efe78 EFLAGS: 00000246
Aug 12 18:25:58.904935 kernel: RAX: ffff98de81b00000 RBX: ffff98de81b36800 RCX: 000000000000001f
Aug 12 18:25:58.904967 kernel: RDX: 0000000000000000 RSI: ffffffff82373bca RDI: ffffffff8235396f
Aug 12 18:25:58.904998 kernel: RBP: ffffffff826ca1a0 R08: 0000001688056578 R09: 0000000000000018
Aug 12 18:25:58.905029 kernel: R10: 000000000000a82d R11: 0000000000000a33 R12: 0000000000000004
Aug 12 18:25:58.905059 kernel: R13: ffff98de81b36800 R14: 0000000000000004 R15: 0000001688056578
Aug 12 18:25:58.905090 kernel: ? cpuidle_enter_state+0xa4/0x420
Aug 12 18:25:58.905136 kernel: cpuidle_enter+0x29/0x40
Aug 12 18:25:58.905173 kernel: do_idle+0x1fb/0x2c0
Aug 12 18:25:58.905205 kernel: cpu_startup_entry+0x19/0x20
Aug 12 18:25:58.905241 kernel: start_secondary+0x178/0x1c0
Aug 12 18:25:58.905272 kernel: secondary_startup_64+0xb6/0xc0
Aug 12 18:25:58.905302 kernel: ---[ end trace 26a3474779206123 ]---
How often GPU of PC freezes happens
Frequency of (PC freezes by unknown reason (serie of sequential GPU hangs suspected) or GPU hangs logged in systemd journal) are near highest possible. It could happen on logon screen without any user activity or during GUI session actions: on a first or 5th or 40th minute. Average is about 2-3 minutes. It is not a concrete exact action, it is general unexpected case and it did happen in (m)any types of typical user activity such as:
-) on logon screen (without any user action, even mouse touch; saw that for about 7-8 times);
-) moving desktop icons;
-) open start menu;
-) open context menu;
-) moving cursor in the text editor via keyboard navigation keys;
-) surfing in system settings window;
-) typing text in terminal emulator (GUI);
-) installing updates in GUI app or GUI terminal emulator;
-) text selection line-by-line in text editor or canceling selection in the Opera browser;
-) open or surfing in Opera web browser: list of gitlab commits viewing, filling a description of an issue ticket on this gitlab.freedesktop.org, watching youtube videos (not fullscreen and not even touch keyboard and mice at least for about last 1-2 minutes), extremely fast freeze/crash while surfing maps.google.com, maps.ya.ru;
-) LiveCD GUI sessions;
etc.
Platform (CPU): Intel Core i5-8250U
System architecture: uname -m
: x86_64
Kernel version: uname -r
: 5.8.1-2-MANJARO
Linux distribution: Manjaro Linux (desktop environment: KDE)
Machine or motherboard model: Hystou Fanless Mini PC P03B-i5-8250U
Display connector: factory-made cable with connectors: HDMI
(connected to PC) - DVI-D
(connected to monitor)
Error data gathered in current hanged GUI user session (w/o switch into tty2 text mode) with the script collect_GPU_crash_data.zip, which collects:
# Collect main data
sudo cp /sys/class/drm/card0/error ...
sudo dmesg
journalctl -b -o short-precise --no-hostname --dmesg
cat /proc/cmdline
# Collect supplementary data
xrandr --verbose
sudo dmidecode -t bios -t system -t baseboard -t chassis -t processor
mhwd -l -d
cp /etc/X11/xorg.conf.d/20-intel.conf ...
sudo lspci -vvv -G
sudo lspci -vvv -G -H1
sudo lspci -vvv -G -H2
lscpu
lsmod
modinfo i915
modinfo drm
modinfo drm_kms_helper
modinfo intel_gtt
modinfo i2c_algo_bit
sudo systool -v -m i915
sudo systool -v -m drm
sudo systool -v -m drm_kms_helper
sudo systool -v -m intel_gtt
sudo systool -v -m i2c_algo_bit
uname -m
uname -r
tty
inxi -CIGMxxx --no-host
/sys/class/drm/card0/error
file alone:
0_content_of__sys_class_drm_card0_error.zip
Whole gathered data (including the error
file above) are in the archive:
2020.08.12_-_18.27.14_collected_data_of_GPU_hang.zip
Whole gathered data of the same boot while GPU hang not happen yet:
2020.08.12_-18.24.44_collected_data_of_GPU_hang-_the_same_boot_with_GPU_not_hanged_yet.zip