TGL i7-1185G7E GPU hang and sometimes the computer becomes unresponsive (i.e. CPU lock-up).
We are frequently seeing GPU hangs on several units with our TGL CPU (i7-1185G7E). On our units, we can reproduce the issue by repeatedly running several instances of Intel's sample_multi_transcode test tool (see run_test.sh).
The test script uses these commands:
test.sh
#!/bin/bash
while true; do
/usr/share/mfx/samples/sample_multi_transcode -i::h264 test.h264 -o::raw /dev/null -w 1920 -h 1080 -hw -gpucopy::on
done
run_tests.sh
#!/bin/bash
./test.sh >/dev/null 2>/dev/null &
./test.sh >/dev/null 2>/dev/null &
./test.sh >/dev/null 2>/dev/null &
./test.sh >/dev/null 2>/dev/null &
./test.sh >/dev/null 2>/dev/null &
./test.sh >/dev/null 2>/dev/null &
cat /dev/random > /dev/null &
sudo intel_gpu_top
The test clip is a short AVC clip. test.h264
It uses the sample_multi_transcode tool (from Intel) to decode and scale video. 6 instances of the tool are repeatedly executed at the same time.
It also issues this command “cat /dev/random > /dev/null” to put some CPU load on one processor. The "intel_gpu_top" command is used to print out the GPU usage.
After running the test for several hours (usually between 24 and 48 hours), the screen will freeze and the GPU will hang. When it hangs, the intel_gpu_top tool will stop updating the screen. Sometimes, the entire system becomes unresponsive and the mouse and keyboard also stop working. A hard reboot is required to recover from this error.
We ran this test on several units with the 6.0 kernel from drm-tip (built on Sept 2, 2002). On all our units (12 units), we can consistently reproduce the issue.
On one of our units, the GPU hung, but the mouse and keyboard was still working. We captured the GPU error log on this unit.
This is the error file when the GPU hung (sudo cat /sys/class/drm/card0/error > error): error
This is the dmesg log. The log file got truncated because it was very large: dmesg.log
This is the output from dmidecode: dmidecode.txt
System architecture: ("uname -m") x86_64
Kernel version: ("uname -r") 6.0.0-rc3+
Display connector: HDMI
On another unit, the entire system became unreponsive. These messages were printed on the screen:
Note: We couldn't capture the GPU error on this unit because it had locked up.
Please let me know if you have any questions. This issue is a showstopper for us because our servers lock up within a couple of days. We need to use the Intel Media SDK to decode and scale video.
Note: We have also created a Debian live image to easily reproduce this issue (via a USB key). We can share this image if it will be useful.
Note: We haved also reproduced the issue with the 5.10, 5.18, and 5.19 kernels.
Thanks, Ralph