[2020.08.07-3] i915 GPU hang report on 5.8.0-1-MANJARO kernel
This is another one case of GPU hang on the same PC (HW + Linux distro). It is my ongoing 1.5-month long rally of PC freezes and GPU hangs.
Since prev. report #2307 (closed) got these packages updates:
grep --text -iE 'installed|upgraded' '/var/log/pacman.log' | tail -n 50
...
<got no any updates since the prev. report>
The next report: #2313 (closed)
My PC experienced about >100 times of (PC freezes + GPU hangs) during last 6 weeks on every kernel 'family' (4.19, 5.4, 5.7, 5.8-rc) avail. in the distro. 4.19 looks like more stable and usually (but far away from always) able to reset GPU and to continue to work without the PC reboot. The more modern kernel version the much faster GPU hangs without any software reset (which 4.19 kernel can do) or PC freezes.
PC freeze or GPU hang usually happens while semi-transparent, fade in/out, blur effects is/are in action.
I have a feeling that fast occurred serie of GPU hangs leads PC to freeze. If only one-two GPU hang happened 'at once' than PC may freeze or may not freeze.
Steps to reproduce the issue in this ticket
I leave PC without power from mains more than 1 hour. After OS was loaded I see typical GUI logon screen with password field. Immediately after cursor stops to blink. Picture freezes. I did not touch keyboard or mouse.
journalctl
excerpt:
Aug 07 16:19:55.447241 kernel: i915 0000:00:02.0: [drm] Resetting rcs0 for preemption time out
Aug 07 16:19:55.448040 kernel: i915 0000:00:02.0: [drm] sddm-greeter[670] context reset due to GPU hang
Aug 07 16:19:55.448634 kernel: i915 0000:00:02.0: [drm:__i915_request_reset.cold [i915]] context sddm-greeter[670]: guilty 1, banned
Aug 07 16:19:55.449220 kernel: i915 0000:00:02.0: [drm:__i915_request_reset.cold [i915]] client sddm-greeter[670]: gained 4 ban score, now 4
Aug 07 16:19:55.454859 kernel: i915 0000:00:02.0: [drm] GPU HANG: ecode 9:1:84df9ffc, in sddm-greeter [670]
Aug 07 16:19:55.455161 kernel: GPU hangs can indicate a bug anywhere in the entire gfx stack, including userspace.
Aug 07 16:19:55.455182 kernel: Please file a _new_ bug report at https://gitlab.freedesktop.org/drm/intel/issues/new.
Aug 07 16:19:55.455211 kernel: Please see https://gitlab.freedesktop.org/drm/intel/-/wikis/How-to-file-i915-bugs for details.
Aug 07 16:19:55.455228 kernel: drm/i915 developers can then reassign to the right component if it's not a kernel issue.
Aug 07 16:19:55.455243 kernel: The GPU crash dump is required to analyze GPU hangs, so please always attach it.
Aug 07 16:19:55.455258 kernel: GPU crash dump saved to /sys/class/drm/card0/error
Aug 07 16:19:55.455273 kernel: ------------[ cut here ]------------
Aug 07 16:19:55.455289 kernel: WARNING: CPU: 2 PID: 0 at kernel/sched/core.c:4488 default_wake_function+0x16/0x30
Aug 07 16:19:55.455305 kernel: Modules linked in: hid_logitech_hidpp input_leds mousedev joydev hid_logitech_dj hid_generic usbhid intel_xhci_usb_role_switch roles rfkill squashfs i915 snd_usb_audio snd_usbmidi_lib x86_pkg_temp_thermal intel_powerclamp coretemp kvm_intel snd_hwdep iTCO_wdt snd_rawmidi intel_pmc_bxt loop snd_seq_device iTCO_vendor_support ee1004 mc snd_pcm kvm snd_timer snd irqbypass i2c_algo_bit crct10dif_pclmul soundcore crc32_pclmul nls_iso8859_1 drm_kms_helper ghash_clmulni_intel nls_cp437 vfat intel_rapl_msr intel_wmi_thunderbolt aesni_intel cec fat crypto_simd cryptd glue_helper r8169 rapl rc_core intel_cstate intel_gtt i2c_i801 syscopyarea intel_uncore realtek sysfillrect sysimgblt pcspkr processor_thermal_device i2c_smbus libphy fb_sys_fops intel_rapl_common intel_pch_thermal intel_soc_dts_iosf wmi int3403_thermal int340x_thermal_zone i2c_hid hid evdev bmc150_accel_i2c mac_hid bmc150_accel_core industrialio_triggered_buffer kfifo_buf int3400_thermal industrialio acpi_thermal_rel drm
Aug 07 16:19:55.462036 kernel: sg crypto_user agpgart ip_tables x_tables ext4 crc32c_generic crc16 mbcache jbd2 xhci_pci xhci_pci_renesas crc32c_intel xhci_hcd
Aug 07 16:19:55.462119 kernel: CPU: 2 PID: 0 Comm: swapper/2 Not tainted 5.8.0-1-MANJARO #1
Aug 07 16:19:55.462158 kernel: Hardware name: Default string Default string/Default string, BIOS 5.12 11/10/2018
Aug 07 16:19:55.462180 kernel: RIP: 0010:default_wake_function+0x16/0x30
Aug 07 16:19:55.462198 kernel: Code: e8 6f de 3d 00 eb 99 0f 0b 66 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 44 00 00 f7 c2 fe ff ff ff 75 09 48 8b 7f 08 e9 0a f9 ff ff <0f> 0b 48 8b 7f 08 e9 ff f8 ff ff 66 66 2e 0f 1f 84 00 00 00 00 00
Aug 07 16:19:55.462216 kernel: RSP: 0018:ffffaf3280178d80 EFLAGS: 00010086
Aug 07 16:19:55.462236 kernel: RAX: ffffffffae0e4c40 RBX: ffffaf3280447d30 RCX: ffffaf3280178d98
Aug 07 16:19:55.462254 kernel: RDX: 00000000fffffffb RSI: 0000000000000003 RDI: ffffaf3280447d30
Aug 07 16:19:55.462271 kernel: RBP: ffff97506a4ed568 R08: 0000000000000001 R09: 0000000000000001
Aug 07 16:19:55.462288 kernel: R10: ffff97505e5c7b80 R11: 0000000000000800 R12: 0000000000000046
Aug 07 16:19:55.462308 kernel: R13: ffff97506a4ed560 R14: ffffaf3280178d98 R15: ffff975063d50b80
Aug 07 16:19:55.462326 kernel: FS: 0000000000000000(0000) GS:ffff975081b00000(0000) knlGS:0000000000000000
Aug 07 16:19:55.462346 kernel: CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
Aug 07 16:19:55.462363 kernel: CR2: 0000564869fa8ff8 CR3: 000000073840a006 CR4: 00000000003606e0
Aug 07 16:19:55.462380 kernel: Call Trace:
Aug 07 16:19:55.462427 kernel: <IRQ>
Aug 07 16:19:55.462444 kernel: autoremove_wake_function+0xe/0x30
Aug 07 16:19:55.462462 kernel: __i915_sw_fence_complete+0x156/0x1b0 [i915]
Aug 07 16:19:55.462479 kernel: dma_i915_sw_fence_wake_timer+0x2c/0x50 [i915]
Aug 07 16:19:55.462499 kernel: signal_irq_work+0x23e/0x350 [i915]
Aug 07 16:19:55.462521 kernel: irq_work_single+0x2c/0x40
Aug 07 16:19:55.462540 kernel: irq_work_run_list+0x2d/0x40
Aug 07 16:19:55.462559 kernel: irq_work_run+0x26/0x40
Aug 07 16:19:55.462576 kernel: __sysvec_irq_work+0x2d/0xf0
Aug 07 16:19:55.462594 kernel: sysvec_irq_work+0x41/0xe0
Aug 07 16:19:55.462612 kernel: asm_sysvec_irq_work+0x12/0x20
Aug 07 16:19:55.462638 kernel: RIP: 0010:__do_softirq+0x93/0x352
Aug 07 16:19:55.462656 kernel: Code: c7 44 24 28 0a 00 00 00 44 89 74 24 04 48 c7 c7 ef 9f 3d af e8 8e 19 bf ff 65 66 c7 05 b4 ba 22 51 00 00 fb 66 0f 1f 44 00 00 <48> c7 44 24 08 c0 50 60 af b8 ff ff ff ff 0f bc 44 24 04 83 c0 01
Aug 07 16:19:55.462681 kernel: RSP: 0018:ffffaf3280178f90 EFLAGS: 00000292
Aug 07 16:19:55.462699 kernel: RAX: 0000000000000002 RBX: ffff97507deb1f00 RCX: 000000000000001f
Aug 07 16:19:55.462719 kernel: RDX: 0000000000000000 RSI: ffffffffaf3d9fef RDI: ffffffffaf372046
Aug 07 16:19:55.462736 kernel: RBP: ffffaf32800efd60 R08: 0000000392682c0c R09: 0000000000000000
Aug 07 16:19:55.462753 kernel: R10: 0000000000000000 R11: 0000000000000000 R12: ffff97507b8e9a00
Aug 07 16:19:55.462769 kernel: R13: ffffffffae10c920 R14: 0000000000000001 R15: ffffaf3280179000
Aug 07 16:19:55.462785 kernel: ? handle_fasteoi_irq+0x210/0x210
Aug 07 16:19:55.462802 kernel: ? handle_irq_event+0x78/0xb0
Aug 07 16:19:55.462822 kernel: ? handle_fasteoi_irq+0x210/0x210
Aug 07 16:19:55.462843 kernel: asm_call_on_stack+0x12/0x20
Aug 07 16:19:55.462863 kernel: </IRQ>
Aug 07 16:19:55.462880 kernel: do_softirq_own_stack+0x5f/0x80
Aug 07 16:19:55.462897 kernel: irq_exit_rcu+0xcb/0x120
Aug 07 16:19:55.462916 kernel: common_interrupt+0xd1/0x200
Aug 07 16:19:55.462933 kernel: asm_common_interrupt+0x1e/0x40
Aug 07 16:19:55.462950 kernel: RIP: 0010:cpuidle_enter_state+0xb6/0x420
Aug 07 16:19:55.462966 kernel: Code: 80 76 82 51 e8 5b 3d 8e ff 49 89 c7 0f 1f 44 00 00 31 ff e8 8c 4b 8e ff 80 7c 24 0f 00 0f 85 06 02 00 00 fb 66 0f 1f 44 00 00 <45> 85 e4 0f 88 e9 01 00 00 49 63 d4 4c 2b 7c 24 10 48 8d 04 52 48
Aug 07 16:19:55.462983 kernel: RSP: 0018:ffffaf32800efe78 EFLAGS: 00000246
Aug 07 16:19:55.463000 kernel: RAX: ffff975081b00000 RBX: ffff975081b36800 RCX: 000000000000001f
Aug 07 16:19:55.463020 kernel: RDX: 0000000000000000 RSI: ffffffffaf36a0b2 RDI: ffffffffaf349f8f
Aug 07 16:19:55.463037 kernel: RBP: ffffffffaf6c9bc0 R08: 0000000392681fb4 R09: 0000000000000018
Aug 07 16:19:55.463053 kernel: R10: 00000000000040be R11: 0000000000037e53 R12: 0000000000000008
Aug 07 16:19:55.463069 kernel: R13: ffff975081b36800 R14: 0000000000000008 R15: 0000000392681fb4
Aug 07 16:19:55.463086 kernel: ? cpuidle_enter_state+0xa4/0x420
Aug 07 16:19:55.463102 kernel: cpuidle_enter+0x29/0x40
Aug 07 16:19:55.463118 kernel: do_idle+0x1fb/0x2c0
Aug 07 16:19:55.463135 kernel: cpu_startup_entry+0x19/0x20
Aug 07 16:19:55.463152 kernel: start_secondary+0x178/0x1c0
Aug 07 16:19:55.463174 kernel: secondary_startup_64+0xb6/0xc0
Aug 07 16:19:55.463194 kernel: ---[ end trace 01a534417aafeda8 ]---
How often does the steps listed above trigger the issue
Frequency of the PC freeze or GPU hang are near highest possible. It could happen on logon screen without any user activity or during GUI session actions: on a first or 5th or 40th minute. Average is about 1-2 minutes. It is not a concrete exact action, it is general unexpected case and it did happen in (m)any types of typical user activity such as:
-) on logon screen (without any user action, even mouse touch; saw that for about 5-6 times),
-) moving desktop icons,
-) open start menu,
-) open context menu,
-) moving cursor in the text editor via keyboard navigation keys,
-) surfing in system settings window,
-) typing text in terminal emulator (GUI),
-) installing updates in GUI app or GUI terminal emulator,
-) open or surfing in Opera web browser: list of gitlab commits viewing, watching youtube videos (not fullscreen and not even touch keyboard and mice at least for about last 1-2 minutes), extremely fast freeze/crash while surfing maps.google.com, maps.ya.ru,
etc.
Platform (CPU): Intel Core i5-8250U
System architecture: uname -m
: x86_64
Kernel version: uname -r
: 5.8.0-1-MANJARO
Linux distribution: Manjaro Linux (desktop environment: KDE)
Machine or motherboard model: Hystou Fanless Mini PC P03B-i5-8250U
Display connector: factory-made cable with connectors: HDMI
(connected to PC) - DVI-D
(connected to monitor)
Also:
KDE System Settings
has default Composer
settings.
The GPU settings file /etc/X11/xorg.conf.d/20-intel.conf
is empty.
Error data gathered after switch to tty2 collect_GPU_crash_data.zip, which collects:
# Collect main data
sudo cp /sys/class/drm/card0/error ...
sudo dmesg
journalctl -b -o short-precise --no-hostname --dmesg
journalctl -b -o short-precise --no-hostname
cat /proc/cmdline
# Collect supplementary data
xrandr --verbose
sudo dmidecode -t bios -t system -t baseboard -t chassis -t processor
mhwd -l -d
cp /etc/X11/xorg.conf.d/20-intel.conf ...
sudo lspci -vvv -G
sudo lspci -vvv -G -H1
sudo lspci -vvv -G -H2
lscpu
lsmod
modinfo i915
modinfo drm
modinfo drm_kms_helper
modinfo intel_gtt
modinfo i2c_algo_bit
sudo systool -v -m i915
sudo systool -v -m drm
sudo systool -v -m drm_kms_helper
sudo systool -v -m intel_gtt
sudo systool -v -m i2c_algo_bit
uname -m
uname -r
inxi -CIGMxxx --no-host
/sys/class/drm/card0/error
file alone:
0_content_of__sys_class_drm_card0_error.zip
Whole gathered data (including the error
file above) are in the archive: