amdgpu: ring gfx timeout and hang on vkQueueSubmit after signalling a semaphore from OpenGL on GPU load
System information
System:
Host: autumnblaze Kernel: 5.7.9-200.fc32.x86_64 x86_64 bits: 64
compiler: gcc v: 2.34-3.fc32) Desktop: Gnome 3.36.4 wm: gnome-shell
dm: GDM Distro: Fedora release 32 (Thirty Two)
CPU:
Topology: Quad Core model: Intel Core i7-3770K bits: 64 type: MT MCP
arch: Ivy Bridge rev: 9 L2 cache: 8192 KiB
flags: avx lm nx pae sse sse2 sse3 sse4_1 sse4_2 ssse3 vmx bogomips: 55940
Speed: 1598 MHz min/max: 1600/4200 MHz Core speeds (MHz): 1: 1598 2: 1598
3: 1598 4: 1599 5: 1598 6: 1599 7: 1598 8: 1598
Graphics:
Device-1: Intel Xeon E3-1200 v2/3rd Gen Core processor Graphics
vendor: ASRock driver: i915 v: kernel bus ID: 00:02.0 chip ID: 8086:0162
Device-2: AMD Ellesmere [Radeon RX 470/480/570/570X/580/580X/590]
vendor: Sapphire Limited Nitro+ driver: amdgpu v: kernel bus ID: 01:00.0
chip ID: 1002:67df
Device-3: NVIDIA GM204 [GeForce GTX 970] vendor: Gigabyte driver: nvidia
v: 440.100 bus ID: 02:00.0 chip ID: 10de:13c2
Display: x11 server: Fedora Project X.org 1.20.8 compositor: gnome-shell
driver: amdgpu,ati,modesetting,nvidia unloaded: fbdev,nouveau,vesa
alternate: nv resolution: 2560x1440 s-dpi: 96
OpenGL: renderer: Radeon RX 580 Series (POLARIS10 DRM 3.37.0
5.7.9-200.fc32.x86_64 LLVM 10.0.0)
v: 4.6 Mesa 20.1.3 direct render: Yes
Issue happens on both X and Wayland.
Describe the issue
I am creating a Vulkan image, exporting it to OpenGL, blitting to it, then signalling a semaphore to import it back into Vulkan. However, issuing a vkQueueSubmit
after signalling the semaphore on the OpenGL side causes ring gfx timeout and system hangs.
I don't have a clean standalone reproduction, but you can see the OpenGL side here: https://github.com/YaLTeR/bxt-rs/blob/0721b28573c689009b831e9d8dd9a43af33a9854/src/modules/capture/mod.rs#L247
If you look at acquire_image_and_sample
you can see all it's got is a vkQueueSubmit
with an empty command buffer: https://github.com/YaLTeR/bxt-rs/blob/0721b28573c689009b831e9d8dd9a43af33a9854/src/modules/capture/vulkan.rs#L173
Note that this crash seems to only occur under GPU load. If I run this code with a small framebuffer and no programs open, it runs fine and performs its job correctly (the commented code there does an image copy, then runs a compute shader to convert colors from RGB to YUV and saves the result into a file, all of this works completely correctly if the GPU crash doesn't happen). However, if I am recording a screencast (with OBS for example) or if I run it on a 16K framebuffer (xrandr --output DisplayPort-2 --scale-from 15360x8640
), or if I run it on a small framebuffer but spam it multiple times over a short timespan, then the GPU crash happens. Enabling the validation layers makes the crash easier to happen, too (because of the additional overhead?).
Commenting out the call to acquire_image_and_sample
(so vkQueueSubmit
) makes the GPU crash disappear. I can even spam that code on a 16K framebuffer without any issues.
Sometimes after this GPU crash I can switch to a different VT and kill gnome-session to recover, but frequently it results in a complete system hang.
Log files as attachment
-
journalctl
of a session where I got the crash and the system hanged: kernel.log
Screenshots/video files (if applicable)
With commented out rest of the code, after the crash I usually see last 2 frames displayed on my screen quickly repeated.
When the crash occurs with the rest of the code there this is what I see:
When I am able to switch to a different VT, the "out of bounds" virtual console contents on my big monitor display these corrupted pixels as well.
Presumably when the code continues to run after the crash it overwrites some now-deallocated memory or something. Also when the code is not commented out, it doesn't say "but soft recovered", rather it does a GPU reset.