[2020.08.07-5] i915 GPU hang report on 5.8.0-1-MANJARO kernel
This is another one case of GPU hang on the same PC (HW + Linux distro). It is my ongoing 1.5-month long rally of PC freezes and GPU hangs.
Since prev. report #2313 (closed) got these packages updates:
grep --text -iE 'installed|upgraded' '/var/log/pacman.log' | tail -n 50
...
[2020-08-07T16:43:03+0000] [ALPM] upgraded linux419 (4.19.137-1 -> 4.19.138-1)
[2020-08-07T16:43:04+0000] [ALPM] upgraded linux54 (5.4.56-1 -> 5.4.57-1)
Further ticket: #2315 (closed)
My PC experienced about >100 times of (PC freezes + GPU hangs) during last 6 weeks on every kernel 'family' (4.19, 5.4, 5.7, 5.8-rc) avail. in the distro. 4.19 looks like more stable and usually (but far away from always) able to reset GPU and to continue to work without the PC reboot. The more modern kernel version the much faster GPU hangs without any software reset (which 4.19 kernel can do) or PC freezes.
PC freeze or GPU hang usually happens while semi-transparent, fade in/out, blur effects is/are in action.
I have a feeling that fast occurred serie of GPU hangs leads PC to freeze. If only one-two GPU hang happened 'at once' than PC may freeze or may not freeze.
Steps to reproduce the issue in this ticket
I was acting on the PC about 15 minutes (a few minutes of youtube video and then about 7-8 minutes of online 3D in-browser game cs-online.club). After play on one server I leave it in a list of servers I choose another one and loading resources phase started. After it it was Connecting...
phase and picture freezes on Connected!
phase. I noted that I see mouse pointer on the screen and can move it just as usual. I tried to open a new web browser (Opera) tab and think it was open, but picture on the screen was the same as was before new tab opens (I believe that it was not on a screen just cause of GPU hang). Than I press the hot key to collect error data.
journalctl
excerpt:
Aug 07 20:22:06.863624 kernel: [drm:drm_atomic_get_crtc_state [drm]] Added [CRTC:51:pipe A] 0000000012c28cc1 state to 00000000df95a597
Aug 07 20:22:06.863655 kernel: [drm:drm_atomic_get_plane_state [drm]] Added [PLANE:31:plane 1A] 0000000016250959 state to 00000000df95a597
Aug 07 20:22:06.863678 kernel: [drm:drm_atomic_set_fb_for_plane [drm]] Set [FB:118] for [PLANE:31:plane 1A] state 0000000016250959
Aug 07 20:22:06.863708 kernel: [drm:drm_atomic_check_only [drm]] checking 00000000df95a597
Aug 07 20:22:06.863733 kernel: i915 0000:00:02.0: [drm:intel_plane_atomic_calc_changes [i915]] [CRTC:51:pipe A] with [PLANE:31:plane 1A] visible 1 -> 1, off 0, on 0, ms 0
Aug 07 20:22:06.863985 kernel: i915 0000:00:02.0: [drm:intel_atomic_get_global_obj_state [i915]] Added new global object 00000000ba9237e9 state 00000000609fb15a to 00000000df95a597
Aug 07 20:22:06.864186 kernel: [drm:drm_atomic_nonblocking_commit [drm]] committing 00000000df95a597 nonblocking
Aug 07 20:22:07.514946 kernel: i915 0000:00:02.0: [drm] Resetting rcs0 for preemption time out
Aug 07 20:22:07.516513 kernel: i915 0000:00:02.0: [drm] opera[1243] context reset due to GPU hang
Aug 07 20:22:07.517447 kernel: i915 0000:00:02.0: [drm:__i915_request_reset.cold [i915]] context opera[1243]: guilty 1, banned
Aug 07 20:22:07.517998 kernel: i915 0000:00:02.0: [drm:__i915_request_reset.cold [i915]] client opera[1243]: gained 3 ban score, now 3
Aug 07 20:22:07.523595 kernel: [drm:drm_atomic_state_default_clear [drm]] Clearing atomic state 00000000df95a597
Aug 07 20:22:07.523782 kernel: [drm:__drm_atomic_state_free [drm]] Freeing atomic state 00000000df95a597
Aug 07 20:22:07.547509 kernel: i915 0000:00:02.0: [drm] GPU HANG: ecode 9:1:85dffdfb, in opera [1243]
How often does the steps listed above trigger the issue
Frequency of (PC freezes by unknown reason (serie of sequential GPU hangs suspected) or GPU hangs logged in systemd journal) are near highest possible. It could happen on logon screen without any user activity or during GUI session actions: on a first or 5th or 40th minute. Average is about 1-2 minutes. It is not a concrete exact action, it is general unexpected case and it did happen in (m)any types of typical user activity such as:
-) on logon screen (without any user action, even mouse touch; saw that for about 5-6 times),
-) moving desktop icons,
-) open start menu,
-) open context menu,
-) moving cursor in the text editor via keyboard navigation keys,
-) surfing in system settings window,
-) typing text in terminal emulator (GUI),
-) installing updates in GUI app or GUI terminal emulator,
-) open or surfing in Opera web browser: list of gitlab commits viewing, filling a description of an issue ticket on this gitlab.freedesktop.org, watching youtube videos (not fullscreen and not even touch keyboard and mice at least for about last 1-2 minutes), extremely fast freeze/crash while surfing maps.google.com, maps.ya.ru,
etc.
Platform (CPU): Intel Core i5-8250U
System architecture: uname -m
: x86_64
Kernel version: uname -r
: 5.8.0-1-MANJARO
Linux distribution: Manjaro Linux (desktop environment: KDE)
Machine or motherboard model: Hystou Fanless Mini PC P03B-i5-8250U
Display connector: factory-made cable with connectors: HDMI
(connected to PC) - DVI-D
(connected to monitor)
Also:
KDE System Settings
has default Composer
settings.
The GPU settings file /etc/X11/xorg.conf.d/20-intel.conf
is empty.
Error data gathered in current hanged GUI user session (w/o switch into tty2 text mode) with the script collect_GPU_crash_data.zip, which collects:
# Collect main data
sudo cp /sys/class/drm/card0/error ...
sudo dmesg
journalctl -b -o short-precise --no-hostname --dmesg
journalctl -b -o short-precise --no-hostname
cat /proc/cmdline
# Collect supplementary data
xrandr --verbose
sudo dmidecode -t bios -t system -t baseboard -t chassis -t processor
mhwd -l -d
cp /etc/X11/xorg.conf.d/20-intel.conf ...
sudo lspci -vvv -G
sudo lspci -vvv -G -H1
sudo lspci -vvv -G -H2
lscpu
lsmod
modinfo i915
modinfo drm
modinfo drm_kms_helper
modinfo intel_gtt
modinfo i2c_algo_bit
sudo systool -v -m i915
sudo systool -v -m drm
sudo systool -v -m drm_kms_helper
sudo systool -v -m intel_gtt
sudo systool -v -m i2c_algo_bit
uname -m
uname -r
tty
inxi -CIGMxxx --no-host
/sys/class/drm/card0/error
file alone:
0_content_of__sys_class_drm_card0_error.zip
Whole gathered data (including the error
file above) are in the archive:
2020.08.07_-20.22.47_collected_data_of_GPU_crash-_GPU_hang.zip
The same script gathered the data but on the next boot while GPU was not hanged yet:
2020.08.07_-20.23.41_collected_data_of_GPU_crash-_the_next_boot__GPU_not_hanged_yet.zip