[iGVT-g][SKL] GPU Hang and iGVT-g guest crash under certain loads
@FurretUber
Submitted by FurretUber Assigned to Terrence Xu
Link to original bug (#107475)
Description
Created attachment 140955
/sys/class/drm/card0/error
When using a Windows 10 guest with Intel GVT-g with dma-buf, it's noticeable that many graphical workloads have stuttering, some applications may crash and some consistently make the guest crash with a blue screen on the guest and cause a GPU Hang on the host.
To reproduce the problem consistently, a Windows 10 1803 guest with dma-buf using the Intel HD Graphics driver version 24.20.100.6194 is required. The QEMU command line used to start the guest is:
env PULSE_LATENCY_MSEC=10 QEMU_AUDIO_ADC_VOICES=0 QEMU_AUDIO_DRV=pa <br>
nice -n -15 <br>
qemu-system-x86_64 -name "Windows 10" -k pt-br -nodefaults <br>
-mem-prealloc -mem-path /dev/hugepages/libvirt/qemu <br>
-hda redm.qcow2 <br>
-hdb redm-D.qcow2 <br>
-enable-kvm -cpu host -smp cores=2,threads=2 -m 4G <br>
-device usb-tablet,id=tablet -device usb-host,vendorid=0x1b3f,id=soundcardusb <br>
-vga none -monitor vc -serial stdio -display gtk,gl=on <br>
-device vfio-pci,sysfsdev=/sys/bus/pci/devices/0000:00:02.0/123f09b0-4c00-11e8-a6ca-f3c21e47e012,rombar=0,x-igd-opregion=on,display=on,addr=0x3,id=iHD520 <br>
-cdrom "mídia.iso" <br>
-machine kernel_irqchip=on -global PIIX4_PM.disable_s3=1 -global PIIX4_PM.disable_s4=1 -M pc,usb=true <br>
-netdev bridge,id=hostnet0,br=virbr0 -device e1000,netdev=hostnet0,id=net0,mac=aa:bb:cc:dd:ee:11,addr=0x8
One application I found that consistently causes the blue screen and GPU Hang is a game, that can be downloaded at: https://www.vector.co.jp/download/file/win95/game/fh310532.html Even being a very light workload, it consistently crashes the guest, particularly in the second stage.
It is noticed there is some significant stuttering on the guest that gets worse and worse until the guest crashes with a blue screen (not visible due to lack of VGA modes) and the host suffers a GPU Hang with the following on dmesg:
[ 1748.473459] [drm] GPU HANG: ecode 9:0:0xfacfffff, reason: Hang on rcs0, action: reset
[ 1748.473461] [drm] GPU hangs can indicate a bug anywhere in the entire gfx stack, including userspace.
[ 1748.473462] [drm] Please file a new bug report on bugs.freedesktop.org against DRI -> DRM/Intel
[ 1748.473462] [drm] drm/i915 developers can then reassign to the right component if it's not a kernel issue.
[ 1748.473463] [drm] The gpu crash dump is required to analyze gpu hangs, so please always attach it.
[ 1748.473464] [drm] GPU crash dump saved to /sys/class/drm/card0/error
[ 1748.473484] i915 0000:00:02.0: Resetting rcs0 after gpu hang
[ 1748.998085] gvt: vgpu 1: untracked MMIO 0000207c len 4
[ 1749.003878] gvt: vgpu 1: untracked MMIO 0000207c len 4
[ 1749.011220] gvt: vgpu 1: untracked MMIO 0000207c len 4
[ 1749.019816] gvt: vgpu 1: untracked MMIO 0000207c len 4
And dozens of the untracked MMIO messages with the same address and same length appear, then the same message with different addresses appear.
Those issues weren't observed with the 15.45 drivers (the certified ones), but they are unusable on Windows 10 as it automatically updates the driver to a non-functional version.
The Windows 10 guest version is 1803 and is using Intel drivers version 24.20.100.6194. The previous version of the driver, version 24.20.100.6136 does not have those issues, so I think 24.20.100.6194 has a regression making it unusable on iGVT-g guests.
System specifications:
Processor: Intel Core i3-6100U;
Video: Intel HD Graphics 520;
Architecture: amd64;
Mesa: 18.2.0-devel (git-f310e86a42);
Kernel version: 4.17.11-lowlatency;
Distribution: Xubuntu 18.04.1 amd64;
QEMU version: 2.12.91 (v3.0.0-rc3-dirty).
Attachment 140955, "/sys/class/drm/card0/error":
erro_gvt-g.txt