Regression - Fence fallback timer expired on ring - Virtualized environment
Brief summary of the problem:
Context: I am using QubesOS (based on the xen hypervisor) I am using my GPU in a VM. To do that I am doing a GPU passthrough (AMD RX 580). The VM is a linux HVM (archlinux). I was using kernel 5.4 LTS (the linux kernel of the vm), and everything was working fine, and everything work fine up to kernel 5.6 included.
I upgraded the kernel to a more recent version and then the GPU was unusuably slow, with many error showing up in the kernel log. The issue can always be reproduced. Start the VM. Error will always show up in the kernel log. If you try to use the GPU (like starting xorg), everything will be unusuably slow.
[ 8.051268] [drm] Fence fallback timer expired on ring gfx
[ 8.561265] [drm] Fence fallback timer expired on ring comp_1.0.0
[ 9.071264] [drm] Fence fallback timer expired on ring comp_1.1.0
[ 9.581239] [drm] Fence fallback timer expired on ring comp_1.2.0
[ 10.091281] [drm] Fence fallback timer expired on ring comp_1.3.0
[ 10.601270] [drm] Fence fallback timer expired on ring comp_1.0.1
[ 11.111265] [drm] Fence fallback timer expired on ring comp_1.1.1
[ 11.621354] [drm] Fence fallback timer expired on ring comp_1.2.1
[ 12.131265] [drm] Fence fallback timer expired on ring comp_1.3.1
[ 12.641289] [drm] Fence fallback timer expired on ring sdma0
[ 13.151264] [drm] Fence fallback timer expired on ring sdma1
[ 13.681279] [drm] Fence fallback timer expired on ring uvd
[ 14.191309] [drm] Fence fallback timer expired on ring uvd_enc0
[ 14.701269] [drm] Fence fallback timer expired on ring uvd_enc1
[ 15.311266] [drm] Fence fallback timer expired on ring vce0
[ 15.821289] [drm] Fence fallback timer expired on ring vce0
[ 93.821281] [drm] Fence fallback timer expired on ring sdma1
[ 94.471284] [drm] Fence fallback timer expired on ring sdma1
[ 95.531275] [drm] Fence fallback timer expired on ring sdma1
[ 96.181291] [drm] Fence fallback timer expired on ring sdma0
I bisected the issue and sent the details here #1381 (comment 1700698)
By applying the modification If I revert back the integer comparison from amdgpu_runtime_pm != 0 back to amdgpu_runtime_pm > 0, then this issue disappear.
it fix this issue, from kernel 5.7 to 5.9 included. Something new have been added/modified between 5.9 and 5.10 and break things again.
I am in the process of bisecting the second issue/regression, but it is going to take few days
Hardware description and system information:
- CPU: Ryzen 7 1700
- GPU: AMD RX 580
- System Memory: 32GB
- Type of Display Connection: HDMI
How to reproduce the issue:
- Install QubesOS (it is based on the xen hypervisor, i do not think the "qubes" part have any link to the issue)
- Configure the GPU passthrough, using a archlinux HVM (or any other linux system, tested the same issue with a debian previously) https://neowutran.ovh/qubes/articles/gaming_windows_hvm.html https://neowutran.ovh/qubes/articles/gaming_linux_hvm.html
- Start the VM
- Look at the kernel log
sudo dmesg
- Try to start xorg or anything that use the GPU