[2020.08.07-2] i915 GPU hang report on 5.8.0-1-MANJARO kernel
This is another one case of GPU hang on the same PC (HW + Linux distro). It is my ongoing 1.5-month long rally of PC freezes and GPU hangs.
Since prev. report #2306 (closed) got these packages updates:
grep --text -iE 'installed|upgraded' '/var/log/pacman.log' | tail -n 50
...
<got no any updates since prev. report>
The next report is: #2312 (closed)
My PC experienced about >100 times of (PC freezes + GPU hangs) during last 6 weeks on every kernel 'family' (4.19, 5.4, 5.7, 5.8-rc) avail. in the distro. 4.19 looks like more stable and usually (but far away from always) able to reset GPU and to continue to work without the PC reboot. The more modern kernel version the much faster GPU hangs without any software reset (which 4.19 kernel can do) or PC freezes.
PC freeze or GPU hang usually happens while semi-transparent, fade in/out, blur effects is/are in action.
I have a feeling that fast occurred serie of GPU hangs leads PC to freeze. If only one-two GPU hang happened 'at once' than PC may freeze or may not freeze.
Steps to reproduce the issue in this ticket
After OS was loaded I entered user session. I opened Opera web browser. Enters maps.google.com web site. Moved into another location and scale. And than I press to show street view
in g.maps, I saw picture freeze with 2 pictures changing each other with delay about 0.6-0.8 seconds. They are: the same picture of map but one is with street view layer, another one without it.
One of pictures on screen after GPU hangs:
journalctl
excerpt:
Aug 07 04:44:05.080334 kernel: i915 0000:00:02.0: [drm] Resetting rcs0 for preemption time out
Aug 07 04:44:05.081173 kernel: i915 0000:00:02.0: [drm] opera[1204] context reset due to GPU hang
Aug 07 04:44:05.081790 kernel: i915 0000:00:02.0: [drm:__i915_request_reset.cold [i915]] context opera[1204]: guilty 1, banned
Aug 07 04:44:05.082366 kernel: i915 0000:00:02.0: [drm:__i915_request_reset.cold [i915]] client opera[1204]: gained 4 ban score, now 4
Aug 07 04:44:05.094650 kernel: i915 0000:00:02.0: [drm] GPU HANG: ecode 9:1:85dffffb, in opera [1204]
Aug 07 04:44:05.094889 kernel: GPU hangs can indicate a bug anywhere in the entire gfx stack, including userspace.
Aug 07 04:44:05.094913 kernel: Please file a _new_ bug report at https://gitlab.freedesktop.org/drm/intel/issues/new.
Aug 07 04:44:05.094931 kernel: Please see https://gitlab.freedesktop.org/drm/intel/-/wikis/How-to-file-i915-bugs for details.
Aug 07 04:44:05.094962 kernel: drm/i915 developers can then reassign to the right component if it's not a kernel issue.
Aug 07 04:44:05.094979 kernel: The GPU crash dump is required to analyze GPU hangs, so please always attach it.
Aug 07 04:44:05.094997 kernel: GPU crash dump saved to /sys/class/drm/card0/error
Aug 07 04:44:05.095013 kernel: ------------[ cut here ]------------
Aug 07 04:44:05.095037 kernel: WARNING: CPU: 3 PID: 1204 at kernel/sched/core.c:4488 default_wake_function+0x16/0x30
Aug 07 04:44:05.095070 kernel: Modules linked in: snd_seq_dummy snd_hrtimer snd_seq fuse hid_logitech_hidpp joydev input_leds mousedev hid_logitech_dj snd_usb_audio snd_usbmidi_lib snd_hwdep snd_rawmidi snd_seq_device mc hid_generic snd_pcm snd_timer usbhid snd soundcore i915 rfkill squashfs x86_pkg_temp_thermal intel_powerclamp coretemp kvm_intel loop iTCO_wdt intel_pmc_bxt kvm iTCO_vendor_support ee1004 irqbypass crct10dif_pclmul crc32_pclmul intel_wmi_thunderbolt i2c_algo_bit ghash_clmulni_intel intel_rapl_msr nls_iso8859_1 nls_cp437 vfat aesni_intel fat crypto_simd cryptd glue_helper drm_kms_helper rapl r8169 cec i2c_i801 intel_cstate intel_uncore pcspkr realtek i2c_smbus rc_core intel_gtt syscopyarea processor_thermal_device intel_xhci_usb_role_switch libphy intel_pch_thermal intel_rapl_common roles sysfillrect intel_soc_dts_iosf sysimgblt fb_sys_fops wmi bmc150_accel_i2c bmc150_accel_core int3403_thermal int340x_thermal_zone industrialio_triggered_buffer kfifo_buf i2c_hid industrialio hid evdev
Aug 07 04:44:05.095674 kernel: int3400_thermal mac_hid acpi_thermal_rel drm sg crypto_user agpgart ip_tables x_tables ext4 crc32c_generic crc16 mbcache jbd2 xhci_pci xhci_pci_renesas crc32c_intel xhci_hcd
Aug 07 04:44:05.096376 kernel: CPU: 3 PID: 1204 Comm: opera Not tainted 5.8.0-1-MANJARO #1
Aug 07 04:44:05.096415 kernel: Hardware name: Default string Default string/Default string, BIOS 5.12 11/10/2018
Aug 07 04:44:05.096435 kernel: RIP: 0010:default_wake_function+0x16/0x30
Aug 07 04:44:05.096453 kernel: Code: e8 6f de 3d 00 eb 99 0f 0b 66 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 44 00 00 f7 c2 fe ff ff ff 75 09 48 8b 7f 08 e9 0a f9 ff ff <0f> 0b 48 8b 7f 08 e9 ff f8 ff ff 66 66 2e 0f 1f 84 00 00 00 00 00
Aug 07 04:44:05.096477 kernel: RSP: 0000:ffffb45b001b4d50 EFLAGS: 00010086
Aug 07 04:44:05.096494 kernel: RAX: ffffffffab2e4c40 RBX: ffffb45b0127bd30 RCX: ffffb45b001b4d68
Aug 07 04:44:05.096511 kernel: RDX: 00000000fffffffb RSI: 0000000000000003 RDI: ffffb45b0127bd30
Aug 07 04:44:05.096528 kernel: RBP: ffff9f9379b3e568 R08: 0000000000000001 R09: 0000000000000001
Aug 07 04:44:05.096545 kernel: R10: ffff9f936583ab80 R11: 0000000000002400 R12: 0000000000000046
Aug 07 04:44:05.096567 kernel: R13: ffff9f9379b3e560 R14: ffffb45b001b4d68 R15: ffff9f931823c700
Aug 07 04:44:05.096585 kernel: FS: 00007fd912e06c40(0000) GS:ffff9f9381b80000(0000) knlGS:0000000000000000
Aug 07 04:44:05.096602 kernel: CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
Aug 07 04:44:05.096620 kernel: CR2: 00007fa77e57cfd8 CR3: 00000008326bc002 CR4: 00000000003606e0
Aug 07 04:44:05.096636 kernel: Call Trace:
Aug 07 04:44:05.096652 kernel: <IRQ>
Aug 07 04:44:05.096672 kernel: autoremove_wake_function+0xe/0x30
Aug 07 04:44:05.096690 kernel: __i915_sw_fence_complete+0x156/0x1b0 [i915]
Aug 07 04:44:05.096707 kernel: dma_i915_sw_fence_wake_timer+0x2c/0x50 [i915]
Aug 07 04:44:05.096727 kernel: signal_irq_work+0x23e/0x350 [i915]
Aug 07 04:44:05.096748 kernel: irq_work_single+0x2c/0x40
Aug 07 04:44:05.096766 kernel: irq_work_run_list+0x2d/0x40
Aug 07 04:44:05.096800 kernel: irq_work_run+0x26/0x40
Aug 07 04:44:05.096822 kernel: __sysvec_irq_work+0x2d/0xf0
Aug 07 04:44:05.096844 kernel: sysvec_irq_work+0x41/0xe0
Aug 07 04:44:05.096871 kernel: asm_sysvec_irq_work+0x12/0x20
Aug 07 04:44:05.096896 kernel: RIP: 0010:tasklet_action_common.constprop.0+0x2a/0xb0
Aug 07 04:44:05.096919 kernel: Code: 0f 1f 44 00 00 41 55 41 54 49 89 fc 55 53 fa 66 0f 1f 44 00 00 48 8b 2f 48 89 7f 08 48 c7 07 00 00 00 00 fb 66 0f 1f 44 00 00 <48> 85 ed 74 6b 41 89 f5 eb 27 8b 43 10 85 c0 75 66 f0 48 0f ba 73
Aug 07 04:44:05.096942 kernel: RSP: 0000:ffffb45b001b4f68 EFLAGS: 00000286
Aug 07 04:44:05.096964 kernel: RAX: 0000000000000003 RBX: 0000000000000000 RCX: 000000000000001f
Aug 07 04:44:05.096986 kernel: RDX: 0000000000000000 RSI: 0000000000000000 RDI: ffff9f9381b98660
Aug 07 04:44:05.097009 kernel: RBP: ffff9f937df14390 R08: 0000000ba138d626 R09: 0000000000000000
Aug 07 04:44:05.097030 kernel: R10: 0000000000000000 R11: 0000000000000000 R12: ffff9f9381b98660
Aug 07 04:44:05.097052 kernel: R13: ffffffffac8050c0 R14: 0000000000000000 R15: 0000000000000024
Aug 07 04:44:05.097110 kernel: __do_softirq+0x10f/0x352
Aug 07 04:44:05.097133 kernel: ? handle_fasteoi_irq+0x210/0x210
Aug 07 04:44:05.097156 kernel: asm_call_on_stack+0x12/0x20
Aug 07 04:44:05.097178 kernel: </IRQ>
Aug 07 04:44:05.097201 kernel: do_softirq_own_stack+0x5f/0x80
Aug 07 04:44:05.097224 kernel: irq_exit_rcu+0xcb/0x120
Aug 07 04:44:05.097248 kernel: common_interrupt+0xd1/0x200
Aug 07 04:44:05.097270 kernel: ? asm_common_interrupt+0x8/0x40
Aug 07 04:44:05.097292 kernel: asm_common_interrupt+0x1e/0x40
Aug 07 04:44:05.097318 kernel: RIP: 0033:0x55ba3208e751
Aug 07 04:44:05.097346 kernel: Code: 45 d0 75 14 89 d8 48 81 c4 88 00 00 00 5b 41 5c 41 5d 41 5e 41 5f 5d c3 e8 0c cf 9c 03 cc cc cc cc cc cc cc cc cc cc cc cc 55 <48> 89 e5 41 56 53 48 89 fb 48 63 4f 08 48 63 c6 48 01 c8 48 63 4f
Aug 07 04:44:05.097370 kernel: RSP: 002b:00007fffa204fad0 EFLAGS: 00000246
Aug 07 04:44:05.097392 kernel: RAX: 000023153cc45e00 RBX: 000023153cc45f18 RCX: 000000000000079b
Aug 07 04:44:05.097415 kernel: RDX: 000023153c79b180 RSI: 0000000000000000 RDI: 000023153cc45f00
Aug 07 04:44:05.097438 kernel: RBP: 00007fffa204fb20 R08: 0000000000000002 R09: 00007fd904019760
Aug 07 04:44:05.097461 kernel: R10: 0000000000000000 R11: 0000000000000246 R12: 0000000000000000
Aug 07 04:44:05.097484 kernel: R13: 00007fffa204fd50 R14: 000023153cc45e00 R15: 000023153cc45f00
Aug 07 04:44:05.097507 kernel: ---[ end trace b5d17862ec8acc6e ]---
Aug 07 04:44:05.097529 kernel: [drm:drm_atomic_set_fb_for_plane [drm]] Set [FB:113] for [PLANE:47:cursor A] state 0000000018b9ea96
Aug 07 04:44:05.097554 kernel: i915 0000:00:02.0: [drm:intel_plane_atomic_calc_changes [i915]] [CRTC:51:pipe A] with [PLANE:47:cursor A] visible 1 -> 1, off 0, on 0, ms 0
Aug 07 04:44:05.097836 kernel: i915 0000:00:02.0: [drm:i915_gem_context_create_ioctl [i915]] HW context 13 created
Aug 07 04:44:05.102396 kernel: [drm:drm_atomic_set_fb_for_plane [drm]] Set [FB:113] for [PLANE:47:cursor A] state 00000000b3c9ac8e
How often does the steps listed above trigger the issue
Frequency of the PC freeze or GPU hang are near highest possible. It could happen on logon screen without any user activity or during GUI session actions: on a first or 5th or 40th minute. Average is about 1-2 minutes. It is not a concrete exact action, it is general unexpected case and it did happen in (m)any types of typical user activity such as:
-) on logon screen (without any user action, even mouse touch; saw that for about 2-3 times),
-) moving desktop icons,
-) open start menu,
-) open context menu,
-) moving cursor in the text editor via keyboard navigation keys,
-) surfing in system settings window,
-) typing text in terminal emulator (GUI),
-) installing updates in GUI app or GUI terminal emulator,
-) open or surfing in Opera web browser: list of gitlab commits viewing, watching youtube videos (not fullscreen and not even touch keyboard and mice at least for about last 1-2 minutes), extremely fast freeze/crash while surfing maps.google.com, maps.ya.ru,
etc.
Platform (CPU): Intel Core i5-8250U
System architecture: uname -m
: x86_64
Kernel version: uname -r
: 5.8.0-1-MANJARO
Linux distribution: Manjaro Linux (desktop environment: KDE)
Machine or motherboard model: Hystou Fanless Mini PC P03B-i5-8250U
Display connector: factory-made cable with connectors: HDMI
(connected to PC) - DVI-D
(connected to monitor)
Also:
KDE System Settings
has default Composer
settings.
The GPU settings file /etc/X11/xorg.conf.d/20-intel.conf
is empty.
Error data gathered from within that hanged GUI session (without switch to tty2) by pressing a custom global hotkey of KDE, which executes the script collect_GPU_crash_data.zip, which collects:
# Collect main data
sudo cp /sys/class/drm/card0/error ...
sudo dmesg
journalctl -b -o short-precise --no-hostname --dmesg
journalctl -b -o short-precise --no-hostname
cat /proc/cmdline
# Collect supplementary data
xrandr --verbose
sudo dmidecode -t bios -t system -t baseboard -t chassis -t processor
mhwd -l -d
cp /etc/X11/xorg.conf.d/20-intel.conf ...
sudo lspci -vvv -G
sudo lspci -vvv -G -H1
sudo lspci -vvv -G -H2
lscpu
lsmod
modinfo i915
modinfo drm
modinfo drm_kms_helper
modinfo intel_gtt
modinfo i2c_algo_bit
sudo systool -v -m i915
sudo systool -v -m drm
sudo systool -v -m drm_kms_helper
sudo systool -v -m intel_gtt
sudo systool -v -m i2c_algo_bit
uname -m
uname -r
inxi -CIGMxxx --no-host
/sys/class/drm/card0/error
file alone:
0_content_of__sys_class_drm_card0_error.zip
Whole gathered data (including the error
file above) are in the archive:
2020.08.07_-04.44.49_collected_data_of_GPU_crash-_GPU_hanged.zip
Also there are data collected by the same script while 'clean GPU work state' (with still not hanged GPU):
2020.08.07_-04.45.35_collected_data_of_GPU_crash-_a_further_boot_while_GPU_hang_not_happen.zip