[RADV] Incorrect Vulkan usage causes a GPU reset and continued instability.
Brief summary of the problem:
When a command buffer is allocated and immediately submitted, the GPU resets and remains unstable until the system is restarted. By unstable I mean that chromium hangs several times for several minutes each after I restart the X server.
This problem occurs only with RADV and not with the AMD Vulkan drivers.
Note that this usage of Vulkan is incorrect. This doesn't excuse that an unprivileged user can effectively bring down the entire system.
I previously opened this issue here: mesa/mesa#5507 (closed)
Hardware description:
- CPU: AMD Ryzen 5 2600X
- GPU: 07:00.0 VGA compatible controller [0300]: Advanced Micro Devices, Inc. [AMD/ATI] Ellesmere [Radeon RX 470/480/570/570X/580/580X/590] [1002:67df] (rev ef)
System information:
- Distro name and Version: Arch Linux
- Kernel version: Linux desk 5.14.8-arch1-1 #1 (closed) SMP PREEMPT Sun, 26 Sep 2021 19:36:15 +0000 x86_64 GNU/Linux
- AMD official driver version: n/a
How to reproduce the issue:
The following program reproduces the issue. You might have to slightly adjust it on your system.
#include <vulkan/vulkan.h>
#include <stdint.h>
int main(void) {
VkApplicationInfo appInfo = { 0 };
appInfo.sType = VK_STRUCTURE_TYPE_APPLICATION_INFO;
appInfo.apiVersion = VK_API_VERSION_1_0;
VkInstanceCreateInfo instanceCreateInfo = { 0 };
instanceCreateInfo.sType = VK_STRUCTURE_TYPE_INSTANCE_CREATE_INFO;
instanceCreateInfo.pApplicationInfo = &appInfo;
VkInstance instance;
vkCreateInstance(&instanceCreateInfo, NULL, &instance);
uint32_t deviceCount = 1;
VkPhysicalDevice physicalDevice;
vkEnumeratePhysicalDevices(instance, &deviceCount, &physicalDevice);
float queuePriority = 1.0f;
VkDeviceQueueCreateInfo queueCreateInfo = { 0 };
queueCreateInfo.sType = VK_STRUCTURE_TYPE_DEVICE_QUEUE_CREATE_INFO;
queueCreateInfo.queueFamilyIndex = 0;
queueCreateInfo.queueCount = 1;
queueCreateInfo.pQueuePriorities = &queuePriority;
VkDeviceCreateInfo deviceCreateInfo = { 0 };
deviceCreateInfo.sType = VK_STRUCTURE_TYPE_DEVICE_CREATE_INFO;
deviceCreateInfo.pQueueCreateInfos = &queueCreateInfo;
deviceCreateInfo.queueCreateInfoCount = 1;
VkDevice device;
vkCreateDevice(physicalDevice, &deviceCreateInfo, NULL, &device);
VkQueue queue;
vkGetDeviceQueue(device, 0, 0, &queue);
VkCommandPoolCreateInfo info = { 0 };
info.queueFamilyIndex = 0;
VkCommandPool pool;
vkCreateCommandPool(device, &info, NULL, &pool);
VkCommandBufferAllocateInfo allocateInfo = { 0 };
allocateInfo.commandPool = pool;
allocateInfo.commandBufferCount = 1;
VkCommandBuffer buffer;
vkAllocateCommandBuffers(device, &allocateInfo, &buffer);
VkSubmitInfo submitInfo = { 0 };
submitInfo.pCommandBuffers = &buffer;
submitInfo.commandBufferCount = 1;
vkQueueSubmit(queue, 1, &submitInfo, VK_NULL_HANDLE);
}
Log files (for system lockups / game freezes / crashes)
Oct 15 01:41:52 desk kernel: [drm:amdgpu_dm_atomic_commit_tail [amdgpu]] *ERROR* Waiting for fences timed out!
Oct 15 01:41:52 desk kernel: [drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring gfx timeout, signaled seq=83264782, emitted seq=83264784
Oct 15 01:41:52 desk kernel: [drm:amdgpu_job_timedout [amdgpu]] *ERROR* Process information: process pid 0 thread pid 0
Oct 15 01:41:52 desk kernel: amdgpu 0000:07:00.0: amdgpu: GPU reset begin!
Oct 15 01:41:52 desk kernel: [drm:amdgpu_dm_atomic_commit_tail [amdgpu]] *ERROR* Waiting for fences timed out!
Oct 15 01:41:52 desk kernel: [drm:amdgpu_dm_atomic_commit_tail [amdgpu]] *ERROR* Waiting for fences timed out!
Oct 15 01:41:52 desk kernel: [drm:dce110_vblank_set [amdgpu]] *ERROR* Failed to get VBLANK!
Oct 15 01:41:53 desk kernel: amdgpu: cp is busy, skip halt cp
Oct 15 01:41:53 desk kernel: amdgpu: rlc is busy, skip halt rlc
Oct 15 01:41:53 desk kernel: amdgpu 0000:07:00.0: amdgpu: BACO reset
Oct 15 01:41:53 desk kernel: amdgpu 0000:07:00.0: amdgpu: GPU reset succeeded, trying to resume
Oct 15 01:41:53 desk kernel: [drm] PCIE GART of 256M enabled (table at 0x000000F400000000).
Oct 15 01:41:53 desk kernel: [drm] VRAM is lost due to GPU reset!
Oct 15 01:41:54 desk kernel: ------------[ cut here ]------------
Oct 15 01:41:54 desk kernel: amdgpu 0000:07:00.0: drm_WARN_ON(atomic_read(&vblank->refcount) == 0)
Oct 15 01:41:54 desk kernel: WARNING: CPU: 0 PID: 0 at drivers/gpu/drm/drm_vblank.c:1210 drm_vblank_put+0xe4/0xf0 [drm]
Oct 15 01:41:54 desk kernel: Modules linked in: netlink_diag snd_seq_dummy snd_hrtimer snd_seq overlay vrf wireguard curve25519_x86_64 libchacha20poly1305 chacha_x86_64 poly1305_x86_64 libblake2s blake2s_x86_64 libcurve25519_generic libch>
Oct 15 01:41:54 desk kernel: libphy wmi soundcore gpio_amdpt pinctrl_amd gpio_generic mac_hid acpi_cpufreq nls_iso8859_1 vfat fat vboxnetflt(OE) vboxnetadp(OE) vboxdrv(OE) pkcs8_key_parser crypto_user drm fuse ip_tables x_tables ext4 crc>
Oct 15 01:41:54 desk kernel: CPU: 0 PID: 0 Comm: swapper/0 Tainted: G W OE 5.14.8-arch1-1 #1 72b691689b70bfbb45bec8353a1f598ad6e355d8
Oct 15 01:41:54 desk kernel: Hardware name: Gigabyte Technology Co., Ltd. B450M DS3H/B450M DS3H-CF, BIOS F3 12/24/2018
Oct 15 01:41:54 desk kernel: RIP: 0010:drm_vblank_put+0xe4/0xf0 [drm]
Oct 15 01:41:54 desk kernel: Code: 8b 7f 08 4c 8b 67 50 4d 85 e4 74 22 e8 75 0d 19 f8 48 c7 c1 30 56 56 c0 4c 89 e2 48 c7 c7 5c 86 56 c0 48 89 c6 e8 8e f2 52 f8 <0f> 0b eb c3 4c 8b 27 eb d9 0f 1f 00 0f 1f 44 00 00 8b b7 90 00 00
Oct 15 01:41:54 desk kernel: RSP: 0018:ffffa22800003db0 EFLAGS: 00010082
Oct 15 01:41:54 desk kernel: RAX: 0000000000000000 RBX: ffff8de889c60000 RCX: 0000000000000027
Oct 15 01:41:54 desk kernel: RDX: ffff8def7ee18728 RSI: 0000000000000001 RDI: ffff8def7ee18720
Oct 15 01:41:54 desk kernel: RBP: 0000000000000086 R08: 0000000000000000 R09: ffffa22800003be0
Oct 15 01:41:54 desk kernel: R10: ffffa22800003bd8 R11: ffffffffb9acd168 R12: ffff8de8812feee0
Oct 15 01:41:54 desk kernel: R13: ffff8de889c60178 R14: ffff8de889d92180 R15: ffff8de889c74900
Oct 15 01:41:54 desk kernel: FS: 0000000000000000(0000) GS:ffff8def7ee00000(0000) knlGS:0000000000000000
Oct 15 01:41:54 desk kernel: CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
Oct 15 01:41:54 desk kernel: CR2: 0000361601c88fe0 CR3: 000000016f9d2000 CR4: 00000000003506f0
Oct 15 01:41:54 desk kernel: Call Trace:
Oct 15 01:41:54 desk kernel: <IRQ>
Oct 15 01:41:54 desk kernel: dm_pflip_high_irq+0xd3/0x2b0 [amdgpu f01a08a32db101afed581310fa496ec779498af3]
Oct 15 01:41:54 desk kernel: amdgpu_dm_irq_handler+0x89/0x1f0 [amdgpu f01a08a32db101afed581310fa496ec779498af3]
Oct 15 01:41:54 desk kernel: amdgpu_irq_dispatch+0xca/0x210 [amdgpu f01a08a32db101afed581310fa496ec779498af3]
Oct 15 01:41:54 desk kernel: ? check_preempt_curr+0x2f/0x70
Oct 15 01:41:54 desk kernel: amdgpu_ih_process+0x7b/0xf0 [amdgpu f01a08a32db101afed581310fa496ec779498af3]
Oct 15 01:41:54 desk kernel: amdgpu_irq_handler+0x21/0xa0 [amdgpu f01a08a32db101afed581310fa496ec779498af3]
Oct 15 01:41:54 desk kernel: __handle_irq_event_percpu+0x3d/0x190
Oct 15 01:41:54 desk kernel: handle_irq_event+0x58/0xb0
Oct 15 01:41:54 desk kernel: handle_edge_irq+0x96/0x260
Oct 15 01:41:54 desk kernel: __common_interrupt+0x41/0xa0
Oct 15 01:41:54 desk kernel: common_interrupt+0x7e/0xa0
Oct 15 01:41:54 desk kernel: </IRQ>
Oct 15 01:41:54 desk kernel: asm_common_interrupt+0x1e/0x40
Oct 15 01:41:54 desk kernel: RIP: 0010:cpuidle_enter_state+0xc7/0x380
Oct 15 01:41:54 desk kernel: Code: 8b 3d 05 61 7e 47 e8 98 6a 8a ff 49 89 c5 0f 1f 44 00 00 31 ff e8 b9 77 8a ff 45 84 ff 0f 85 da 01 00 00 fb 66 0f 1f 44 00 00 <45> 85 f6 0f 88 11 01 00 00 49 63 d6 4c 2b 2c 24 48 8d 04 52 48 8d
Oct 15 01:41:54 desk kernel: RSP: 0018:ffffffffb9a03e50 EFLAGS: 00000246
Oct 15 01:41:54 desk kernel: RAX: ffff8def7ee2d700 RBX: 0000000000000002 RCX: 000000000000001f
Oct 15 01:41:54 desk kernel: RDX: 0000000000000000 RSI: 00000000239f541c RDI: 0000000000000000
Oct 15 01:41:54 desk kernel: RBP: ffff8de883cd6000 R08: 00016b161b0174a6 R09: 0000000000000018
Oct 15 01:41:54 desk kernel: R10: 00000000000004f4 R11: 000000000000071a R12: ffffffffb9b4e740
Oct 15 01:41:54 desk kernel: R13: 00016b161b0174a6 R14: 0000000000000002 R15: 0000000000000000
Oct 15 01:41:54 desk kernel: cpuidle_enter+0x29/0x40
Oct 15 01:41:54 desk kernel: do_idle+0x1e1/0x270
Oct 15 01:41:54 desk kernel: cpu_startup_entry+0x19/0x20
Oct 15 01:41:54 desk kernel: start_kernel+0x9ab/0x9d0
Oct 15 01:41:54 desk kernel: secondary_startup_64_no_verify+0xc2/0xcb
Oct 15 01:41:54 desk kernel: ---[ end trace beaf8083f77ba54e ]---
Oct 15 01:41:54 desk kernel: [drm:amdgpu_cs_ioctl [amdgpu]] *ERROR* Failed to initialize parser -125!
Oct 15 01:41:54 desk kernel: [drm:amdgpu_cs_ioctl [amdgpu]] *ERROR* Failed to initialize parser -125!
Oct 15 01:41:54 desk kernel: [drm:amdgpu_cs_ioctl [amdgpu]] *ERROR* Failed to initialize parser -125!