amdgpu: ring gfx timeout and hang on vkQueueSubmit after signalling a semaphore from OpenGL on GPU load

System information

System:
  Host: autumnblaze Kernel: 5.7.9-200.fc32.x86_64 x86_64 bits: 64 
  compiler: gcc v: 2.34-3.fc32) Desktop: Gnome 3.36.4 wm: gnome-shell 
  dm: GDM Distro: Fedora release 32 (Thirty Two) 
CPU:
  Topology: Quad Core model: Intel Core i7-3770K bits: 64 type: MT MCP 
  arch: Ivy Bridge rev: 9 L2 cache: 8192 KiB 
  flags: avx lm nx pae sse sse2 sse3 sse4_1 sse4_2 ssse3 vmx bogomips: 55940 
  Speed: 1598 MHz min/max: 1600/4200 MHz Core speeds (MHz): 1: 1598 2: 1598 
  3: 1598 4: 1599 5: 1598 6: 1599 7: 1598 8: 1598 
Graphics:
  Device-1: Intel Xeon E3-1200 v2/3rd Gen Core processor Graphics 
  vendor: ASRock driver: i915 v: kernel bus ID: 00:02.0 chip ID: 8086:0162 
  Device-2: AMD Ellesmere [Radeon RX 470/480/570/570X/580/580X/590] 
  vendor: Sapphire Limited Nitro+ driver: amdgpu v: kernel bus ID: 01:00.0 
  chip ID: 1002:67df 
  Device-3: NVIDIA GM204 [GeForce GTX 970] vendor: Gigabyte driver: nvidia 
  v: 440.100 bus ID: 02:00.0 chip ID: 10de:13c2 
  Display: x11 server: Fedora Project X.org 1.20.8 compositor: gnome-shell 
  driver: amdgpu,ati,modesetting,nvidia unloaded: fbdev,nouveau,vesa 
  alternate: nv resolution: 2560x1440 s-dpi: 96 
  OpenGL: renderer: Radeon RX 580 Series (POLARIS10 DRM 3.37.0 
  5.7.9-200.fc32.x86_64 LLVM 10.0.0) 
  v: 4.6 Mesa 20.1.3 direct render: Yes

Issue happens on both X and Wayland.

Describe the issue

I am creating a Vulkan image, exporting it to OpenGL, blitting to it, then signalling a semaphore to import it back into Vulkan. However, issuing a vkQueueSubmit after signalling the semaphore on the OpenGL side causes ring gfx timeout and system hangs.

I don't have a clean standalone reproduction, but you can see the OpenGL side here: https://github.com/YaLTeR/bxt-rs/blob/0721b28573c689009b831e9d8dd9a43af33a9854/src/modules/capture/mod.rs#L247

If you look at acquire_image_and_sample you can see all it's got is a vkQueueSubmit with an empty command buffer: https://github.com/YaLTeR/bxt-rs/blob/0721b28573c689009b831e9d8dd9a43af33a9854/src/modules/capture/vulkan.rs#L173

Note that this crash seems to only occur under GPU load. If I run this code with a small framebuffer and no programs open, it runs fine and performs its job correctly (the commented code there does an image copy, then runs a compute shader to convert colors from RGB to YUV and saves the result into a file, all of this works completely correctly if the GPU crash doesn't happen). However, if I am recording a screencast (with OBS for example) or if I run it on a 16K framebuffer (xrandr --output DisplayPort-2 --scale-from 15360x8640), or if I run it on a small framebuffer but spam it multiple times over a short timespan, then the GPU crash happens. Enabling the validation layers makes the crash easier to happen, too (because of the additional overhead?).

Commenting out the call to acquire_image_and_sample (so vkQueueSubmit) makes the GPU crash disappear. I can even spam that code on a 16K framebuffer without any issues.

Sometimes after this GPU crash I can switch to a different VT and kill gnome-session to recover, but frequently it results in a complete system hang.

Log files as attachment

journalctl of a session where I got the crash and the system hanged: kernel.log

Screenshots/video files (if applicable)

With commented out rest of the code, after the crash I usually see last 2 frames displayed on my screen quickly repeated.

When the crash occurs with the rest of the code there this is what I see:

When I am able to switch to a different VT, the "out of bounds" virtual console contents on my big monitor display these corrupted pixels as well.

Presumably when the code continues to run after the crash it overwrites some now-deallocated memory or something. Also when the code is not commented out, it doesn't say "but soft recovered", rather it does a GPU reset.

To upload designs, you'll need to enable LFS and have an admin enable hashed storage. More information