[2020.08.06] i915 GPU hang report on 5.8.0-1-MANJARO kernel
This is another one case of GPU hang on the same PC (HW + Linux distro). It is my ongoing 1.5-month long rally of PC freezes and GPU hangs.
Since prev. report #2290 (closed) got these packages updates:
grep --text -iE 'installed|upgraded' '/var/log/pacman.log' | tail -n 50
...
[2020-08-06T11:55:59+0000] [ALPM] upgraded protobuf (3.12.3-1 -> 3.12.4-1)
[2020-08-06T11:55:59+0000] [ALPM] upgraded android-tools (30.0.0-2 -> 30.0.3-1)
[2020-08-06T11:55:59+0000] [ALPM] upgraded lua (5.3.5-3 -> 5.4.0-2)
[2020-08-06T11:55:59+0000] [ALPM] upgraded highlight (3.57-1 -> 3.57-2)
[2020-08-06T11:55:59+0000] [ALPM] upgraded imlib2 (1.6.1-2 -> 1.7.0-1)
[2020-08-06T11:55:59+0000] [ALPM] upgraded libqalculate (3.12.0-1 -> 3.12.1-1)
[2020-08-06T11:56:00+0000] [ALPM] upgraded linux419 (4.19.136-1 -> 4.19.137-1)
[2020-08-06T11:56:01+0000] [ALPM] upgraded linux54 (5.4.55-1 -> 5.4.56-1)
[2020-08-06T11:56:01+0000] [ALPM] upgraded mpg123 (1.26.2-1 -> 1.26.3-1)
[2020-08-06T11:56:01+0000] [ALPM] upgraded perl-alien-build (2.26-3 -> 2.28-1)
[2020-08-06T11:56:01+0000] [ALPM] upgraded podofo (0.9.6-2 -> 0.9.6-3)
[2020-08-06T11:56:01+0000] [ALPM] upgraded tpm2-tss (2.4.1-1 -> 3.0.0-2)
[2020-08-06T11:56:01+0000] [ALPM] upgraded vlc (3.0.11-1 -> 3.0.11-2)
[2020-08-06T20:15:06+0000] [ALPM] upgraded lib32-zstd (1.4.4-2 -> 1.4.5-1)
[2020-08-06T20:15:06+0000] [ALPM] upgraded lib32-libva-mesa-driver (20.1.4-1 -> 20.1.5-1)
[2020-08-06T20:15:06+0000] [ALPM] upgraded mesa (20.1.4-3 -> 20.1.5-1)
[2020-08-06T20:15:07+0000] [ALPM] upgraded lib32-mesa (20.1.4-1 -> 20.1.5-1)
[2020-08-06T20:15:07+0000] [ALPM] upgraded lib32-mesa-vdpau (20.1.4-1 -> 20.1.5-1)
[2020-08-06T20:15:07+0000] [ALPM] upgraded lib32-vulkan-intel (20.1.4-1 -> 20.1.5-1)
[2020-08-06T20:15:07+0000] [ALPM] upgraded lib32-vulkan-radeon (20.1.4-1 -> 20.1.5-1)
[2020-08-06T20:15:07+0000] [ALPM] upgraded libva-mesa-driver (20.1.4-3 -> 20.1.5-1)
[2020-08-06T20:15:07+0000] [ALPM] upgraded mesa-vdpau (20.1.4-3 -> 20.1.5-1)
[2020-08-06T20:15:07+0000] [ALPM] upgraded python-psutil (5.7.1-1 -> 5.7.2-1)
[2020-08-06T20:15:07+0000] [ALPM] upgraded vulkan-intel (20.1.4-3 -> 20.1.5-1)
[2020-08-06T20:15:07+0000] [ALPM] upgraded vulkan-radeon (20.1.4-3 -> 20.1.5-1)
[2020-08-06T22:10:44+0000] [ALPM] upgraded opera (70.0.3728.71-1 -> 70.0.3728.95-1)
[2020-08-06T22:10:44+0000] [ALPM] upgraded opera-ffmpeg-codecs (83.0.4103.116-1 -> 84.0.4147.105-1)
The next issue: #2306 (closed)
My PC experienced about >100 times of (PC freezes + GPU hangs) during last 6 weeks on every kernel 'family' (4.19, 5.4, 5.7, 5.8-rc) avail. in the distro. 4.19 looks like more stable and usually (but far away from always) able to reset GPU and to continue to work without the PC reboot. The more modern kernel version the much faster GPU hangs without any software reset (which 4.19 kernel can do) or PC freezes.
PC freeze or GPU hang usually happens while semi-transparent, fade in/out, blur effects is/are in action.
I have a feeling that fast occurred serie of GPU hangs leads PC to freeze. If only one-two GPU hang happened 'at once' than PC may freeze or may not freeze.
Steps to reproduce the issue in this ticket
After OS loaded enter user session. Open Opera web browser. Enter cs-online.club web site to play 3D game. Choose one of servers. Enter any team. Buy smoke grenade and throw it very close. Then smoked semi-transparent space is just appears the game play sound became fitful and picture frozen. By pressing my hotkey I was able to execute the script to collect the error data. Picture on monitor was frozen with that game image until black screen of PC reboot via software (it is the last line in the script).
Frozen picture on the screen looks like this:
journalctl
excerpt:
Aug 06 23:53:31.932437 kernel: ------------[ cut here ]------------
Aug 06 23:53:31.932463 kernel: WARNING: CPU: 1 PID: 0 at kernel/sched/core.c:4488 default_wake_function+0x16/0x30
Aug 06 23:53:31.932487 kernel: Modules linked in: snd_seq_dummy snd_hrtimer snd_seq fuse hid_logitech_hidpp mousedev joydev input_leds hid_logitech_dj snd_usb_audio snd_usbmidi_lib snd_hwdep snd_rawmidi snd_seq_device mc snd_pcm snd_timer snd soundcore hid_generic usbhid i915 rfkill x86_pkg_temp_thermal intel_powerclamp coretemp kvm_intel squashfs kvm iTCO_wdt intel_pmc_bxt loop ee1004 iTCO_vendor_support i2c_algo_bit irqbypass crct10dif_pclmul intel_rapl_msr crc32_pclmul ghash_clmulni_intel aesni_intel intel_wmi_thunderbolt nls_iso8859_1 crypto_simd nls_cp437 cryptd glue_helper vfat fat rapl drm_kms_helper intel_cstate cec rc_core i2c_i801 intel_uncore r8169 pcspkr i2c_smbus realtek intel_gtt libphy syscopyarea processor_thermal_device sysfillrect intel_xhci_usb_role_switch intel_rapl_common sysimgblt intel_pch_thermal roles fb_sys_fops intel_soc_dts_iosf wmi bmc150_accel_i2c int3403_thermal bmc150_accel_core int340x_thermal_zone industrialio_triggered_buffer i2c_hid kfifo_buf hid industrialio evdev mac_hid
Aug 06 23:53:31.935825 kernel: int3400_thermal acpi_thermal_rel drm sg crypto_user agpgart ip_tables x_tables ext4 crc32c_generic crc16 mbcache jbd2 xhci_pci xhci_pci_renesas crc32c_intel xhci_hcd
Aug 06 23:53:31.935855 kernel: CPU: 1 PID: 0 Comm: swapper/1 Not tainted 5.8.0-1-MANJARO #1
Aug 06 23:53:31.935874 kernel: Hardware name: Default string Default string/Default string, BIOS 5.12 11/10/2018
Aug 06 23:53:31.935890 kernel: RIP: 0010:default_wake_function+0x16/0x30
Aug 06 23:53:31.935908 kernel: Code: e8 6f de 3d 00 eb 99 0f 0b 66 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 44 00 00 f7 c2 fe ff ff ff 75 09 48 8b 7f 08 e9 0a f9 ff ff <0f> 0b 48 8b 7f 08 e9 ff f8 ff ff 66 66 2e 0f 1f 84 00 00 00 00 00
Aug 06 23:53:31.935925 kernel: RSP: 0018:ffffbdf78013ce58 EFLAGS: 00010082
Aug 06 23:53:31.935946 kernel: RAX: ffffffffa10e4c40 RBX: ffffbdf780b27d30 RCX: ffffbdf78013ce70
Aug 06 23:53:31.935965 kernel: RDX: 00000000ffffff92 RSI: 0000000000000003 RDI: ffffbdf780b27d30
Aug 06 23:53:31.935980 kernel: RBP: ffffa14065e52568 R08: 0000000000006b66 R09: 0000000000000001
Aug 06 23:53:31.935999 kernel: R10: 0000000000000000 R11: 0000000000000001 R12: 0000000000000046
Aug 06 23:53:31.936015 kernel: R13: ffffa14065e52560 R14: ffffbdf78013ce70 R15: ffffa140b0bd2828
Aug 06 23:53:31.936029 kernel: FS: 0000000000000000(0000) GS:ffffa140c1a80000(0000) knlGS:0000000000000000
Aug 06 23:53:31.936051 kernel: CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
Aug 06 23:53:31.936077 kernel: CR2: 000056157eb572f8 CR3: 00000001f2c0a001 CR4: 00000000003606e0
Aug 06 23:53:31.936096 kernel: Call Trace:
Aug 06 23:53:31.936112 kernel: <IRQ>
Aug 06 23:53:31.936127 kernel: autoremove_wake_function+0xe/0x30
Aug 06 23:53:31.936145 kernel: __i915_sw_fence_complete+0x156/0x1b0 [i915]
Aug 06 23:53:31.936166 kernel: ? i915_sw_fence_complete+0x20/0x20 [i915]
Aug 06 23:53:31.936184 kernel: ? i915_sw_fence_complete+0x20/0x20 [i915]
Aug 06 23:53:31.936202 kernel: call_timer_fn+0x2d/0x160
Aug 06 23:53:31.936218 kernel: ? i915_sw_fence_complete+0x20/0x20 [i915]
Aug 06 23:53:31.936232 kernel: __run_timers+0x130/0x290
Aug 06 23:53:31.936252 kernel: run_timer_softirq+0x2b/0x50
Aug 06 23:53:31.936267 kernel: __do_softirq+0x10f/0x352
Aug 06 23:53:31.936284 kernel: asm_call_on_stack+0x12/0x20
Aug 06 23:53:31.936304 kernel: </IRQ>
Aug 06 23:53:31.936322 kernel: do_softirq_own_stack+0x5f/0x80
Aug 06 23:53:31.936340 kernel: irq_exit_rcu+0xcb/0x120
Aug 06 23:53:31.936394 kernel: sysvec_apic_timer_interrupt+0x46/0xe0
Aug 06 23:53:31.936416 kernel: asm_sysvec_apic_timer_interrupt+0x12/0x20
Aug 06 23:53:31.936450 kernel: RIP: 0010:cpuidle_enter_state+0xb6/0x420
Aug 06 23:53:31.936469 kernel: Code: 80 76 82 5e e8 5b 3d 8e ff 49 89 c7 0f 1f 44 00 00 31 ff e8 8c 4b 8e ff 80 7c 24 0f 00 0f 85 06 02 00 00 fb 66 0f 1f 44 00 00 <45> 85 e4 0f 88 e9 01 00 00 49 63 d4 4c 2b 7c 24 10 48 8d 04 52 48
Aug 06 23:53:31.936490 kernel: RSP: 0018:ffffbdf7800e7e78 EFLAGS: 00000246
Aug 06 23:53:31.936509 kernel: RAX: ffffa140c1a80000 RBX: ffffa140c1ab6800 RCX: 000000000000001f
Aug 06 23:53:31.936527 kernel: RDX: 0000000000000000 RSI: ffffffffa236a0b2 RDI: ffffffffa2349f8f
Aug 06 23:53:31.936546 kernel: RBP: ffffffffa26c9bc0 R08: 00000012cf6312a7 R09: 0000000000000018
Aug 06 23:53:31.936573 kernel: R10: 0000000000001d33 R11: 00000000000006d1 R12: 0000000000000006
Aug 06 23:53:31.936594 kernel: R13: ffffa140c1ab6800 R14: 0000000000000006 R15: 00000012cf6312a7
Aug 06 23:53:31.936616 kernel: ? cpuidle_enter_state+0xa4/0x420
Aug 06 23:53:31.936631 kernel: cpuidle_enter+0x29/0x40
Aug 06 23:53:31.936649 kernel: do_idle+0x1fb/0x2c0
Aug 06 23:53:31.936673 kernel: cpu_startup_entry+0x19/0x20
Aug 06 23:53:31.936694 kernel: start_secondary+0x178/0x1c0
Aug 06 23:53:31.936713 kernel: secondary_startup_64+0xb6/0xc0
Aug 06 23:53:31.936731 kernel: ---[ end trace 894b70cb2f1e9bf3 ]---
Aug 06 23:53:31.945628 kernel: [drm:drm_atomic_state_default_clear [drm]] Clearing atomic state 0000000003a65cad
Aug 06 23:53:31.945722 kernel: [drm:__drm_atomic_state_free [drm]] Freeing atomic state 0000000003a65cad
Aug 06 23:53:34.516813 kernel: i915 0000:00:02.0: [drm] Resetting rcs0 for preemption time out
Aug 06 23:53:34.517252 kernel: i915 0000:00:02.0: [drm] opera[1193] context reset due to GPU hang
Aug 06 23:53:34.517486 kernel: i915 0000:00:02.0: [drm:__i915_request_reset.cold [i915]] context opera[1193]: guilty 1, banned
Aug 06 23:53:34.517691 kernel: i915 0000:00:02.0: [drm:__i915_request_reset.cold [i915]] client opera[1193]: gained 3 ban score, now 3
Aug 06 23:53:34.528927 kernel: i915 0000:00:02.0: [drm] GPU HANG: ecode 9:1:87d77cf6, in opera [1193]
How often does the steps listed above trigger the issue
Frequency of the PC freeze or GPU hang are near highest possible. It could happen on logon screen without any user activity or during GUI session actions: on a first or 5th or 10th minute. Average is about 1-2 minutes. It is not a concrete exact action, it is general unexpected case and it did happen in (m)any types of typical user activity such as:
-) on logon screen (without any user action, even mouse touch; saw that for about 2-3 times),
-) moving desktop icons,
-) open start menu,
-) open context menu,
-) moving cursor in the text editor via keyboard navigation keys,
-) surfing in system settings window,
-) typing text in terminal emulator (GUI),
-) installing updates in GUI app or GUI terminal emulator,
-) open or surfing in Opera web browser: list of gitlab commits viewing, watching youtube videos (not fullscreen and not even touch keyboard and mice at least for about last 1-2 minutes), extremely fast freeze/crash while surfing maps.google.com, maps.ya.ru,
etc.
Platform (CPU): Intel Core i5-8250U
System architecture: uname -m
: x86_64
Kernel version: uname -r
: 5.8.0-1-MANJARO
Linux distribution: Manjaro Linux (desktop environment: KDE)
Machine or motherboard model: Hystou Fanless Mini PC P03B-i5-8250U
Display connector: factory-made cable with connectors: HDMI
(connected to PC) - DVI-D
(connected to monitor)
Also:
KDE System Settings
has default Composer
settings.
The GPU settings file /etc/X11/xorg.conf.d/20-intel.conf
is empty.
Error data gathered from within that hanged GUI session (without switch to tty2) by pressing a custom global hotkey of KDE, which executes the script collect_GPU_crash_data.zip, which collects:
# Collect main data
sudo cp /sys/class/drm/card0/error ...
sudo dmesg
journalctl -b -o short-precise --no-hostname --dmesg
journalctl -b -o short-precise --no-hostname
cat /proc/cmdline
# Collect supplementary data
xrandr --verbose
sudo dmidecode -t bios -t system -t baseboard -t chassis -t processor
mhwd -l -d
cp /etc/X11/xorg.conf.d/20-intel.conf ...
sudo lspci -vvv -G
sudo lspci -vvv -G -H1
sudo lspci -vvv -G -H2
lscpu
lsmod
modinfo i915
modinfo drm
modinfo drm_kms_helper
modinfo intel_gtt
modinfo i2c_algo_bit
sudo systool -v -m i915
sudo systool -v -m drm
sudo systool -v -m drm_kms_helper
sudo systool -v -m intel_gtt
sudo systool -v -m i2c_algo_bit
uname -m
uname -r
inxi -CIGMxxx --no-host
/sys/class/drm/card0/error
file alone:
0_content_of__sys_class_drm_card0_error.zip
Whole gathered data (including the error
file above) are in the archive:
2020.08.06_-23.53.50_collected_data_of_GPU_crash-_GPU_hanged.zip
Also there are data collected by the same script while 'clean GPU work state' (with still not hanged GPU):
2020.08.06_-23.55.19_collected_data_of_GPU_crash-_a_further_boot_while_GPU_hang_not_happen.zip