amdgpu: ring gfx timeout and hang on vkQueueSubmit after signalling a semaphore from OpenGL on GPU load
System: Host: autumnblaze Kernel: 5.7.9-200.fc32.x86_64 x86_64 bits: 64 compiler: gcc v: 2.34-3.fc32) Desktop: Gnome 3.36.4 wm: gnome-shell dm: GDM Distro: Fedora release 32 (Thirty Two) CPU: Topology: Quad Core model: Intel Core i7-3770K bits: 64 type: MT MCP arch: Ivy Bridge rev: 9 L2 cache: 8192 KiB flags: avx lm nx pae sse sse2 sse3 sse4_1 sse4_2 ssse3 vmx bogomips: 55940 Speed: 1598 MHz min/max: 1600/4200 MHz Core speeds (MHz): 1: 1598 2: 1598 3: 1598 4: 1599 5: 1598 6: 1599 7: 1598 8: 1598 Graphics: Device-1: Intel Xeon E3-1200 v2/3rd Gen Core processor Graphics vendor: ASRock driver: i915 v: kernel bus ID: 00:02.0 chip ID: 8086:0162 Device-2: AMD Ellesmere [Radeon RX 470/480/570/570X/580/580X/590] vendor: Sapphire Limited Nitro+ driver: amdgpu v: kernel bus ID: 01:00.0 chip ID: 1002:67df Device-3: NVIDIA GM204 [GeForce GTX 970] vendor: Gigabyte driver: nvidia v: 440.100 bus ID: 02:00.0 chip ID: 10de:13c2 Display: x11 server: Fedora Project X.org 1.20.8 compositor: gnome-shell driver: amdgpu,ati,modesetting,nvidia unloaded: fbdev,nouveau,vesa alternate: nv resolution: 2560x1440 s-dpi: 96 OpenGL: renderer: Radeon RX 580 Series (POLARIS10 DRM 3.37.0 5.7.9-200.fc32.x86_64 LLVM 10.0.0) v: 4.6 Mesa 20.1.3 direct render: Yes
Issue happens on both X and Wayland.
Describe the issue
I am creating a Vulkan image, exporting it to OpenGL, blitting to it, then signalling a semaphore to import it back into Vulkan. However, issuing a
vkQueueSubmit after signalling the semaphore on the OpenGL side causes ring gfx timeout and system hangs.
I don't have a clean standalone reproduction, but you can see the OpenGL side here: https://github.com/YaLTeR/bxt-rs/blob/0721b28573c689009b831e9d8dd9a43af33a9854/src/modules/capture/mod.rs#L247
If you look at
acquire_image_and_sample you can see all it's got is a
vkQueueSubmit with an empty command buffer: https://github.com/YaLTeR/bxt-rs/blob/0721b28573c689009b831e9d8dd9a43af33a9854/src/modules/capture/vulkan.rs#L173
Note that this crash seems to only occur under GPU load. If I run this code with a small framebuffer and no programs open, it runs fine and performs its job correctly (the commented code there does an image copy, then runs a compute shader to convert colors from RGB to YUV and saves the result into a file, all of this works completely correctly if the GPU crash doesn't happen). However, if I am recording a screencast (with OBS for example) or if I run it on a 16K framebuffer (
xrandr --output DisplayPort-2 --scale-from 15360x8640), or if I run it on a small framebuffer but spam it multiple times over a short timespan, then the GPU crash happens. Enabling the validation layers makes the crash easier to happen, too (because of the additional overhead?).
Commenting out the call to
vkQueueSubmit) makes the GPU crash disappear. I can even spam that code on a 16K framebuffer without any issues.
Sometimes after this GPU crash I can switch to a different VT and kill gnome-session to recover, but frequently it results in a complete system hang.
Log files as attachment
journalctlof a session where I got the crash and the system hanged: kernel.log
Screenshots/video files (if applicable)
With commented out rest of the code, after the crash I usually see last 2 frames displayed on my screen quickly repeated.
When I am able to switch to a different VT, the "out of bounds" virtual console contents on my big monitor display these corrupted pixels as well.
Presumably when the code continues to run after the crash it overwrites some now-deallocated memory or something. Also when the code is not commented out, it doesn't say "but soft recovered", rather it does a GPU reset.