intel vulkan: DEVICE_LOST or assert after 16k command buffers from same command pool

System information

OS: Arch Linux
GPU: 00:02.0 VGA compatible controller [0300]: Intel Corporation Skylake GT2 [HD Graphics 520] [8086:1916] (rev 07)
Kernel version: Linux hostname 6.1.1-arch1-1 #1 (closed) SMP PREEMPT_DYNAMIC Wed, 21 Dec 2022 22:27:55 +0000 x86_64 GNU/Linux
Mesa version: 22.3.2 or mesa git 1825ad13 (2023 Jan 02)
Xserver version (if applicable):X.Org X Server 1.21.1.6
Desktop manager and compositor: openbox, picom

Describe the issue

Update

MESA ANV Vulkan driver will consistently return VK_DEVICE_LOST from vkQueueSubmit2 (or assert in debug build) after exactly 16,384 command buffers have been allocated from the same command pool.

This can be reproduced in a headless app that does:

for (i = 1..16384 + 1) {
   cmd = vkAllocateCommandBuffers(pool, num = 1);
   vkBeginCommandBuffer(cmd);
   vkEndCommandBuffer(cmd);
   vkQueueSubmit2(cmd); // DEVICE_LOST after 16k repeats
}

Below is the original bug report and gfxrecon recording from an application, which does a few submits per frame.

Original bug report

In a Vulkan gfx app I am writing (xmas hobby hacking yea!), I get VK_ERROR_DEVICE_LOST from vkQueueSubmit2 after thousands of frames (several minutes) of working perfectly. Validation layers are clean, lavapipe works perfectly.

I have attached API traces captured with GFXreconctruct which reproduce this issue reliably every time on Mesa 22.3.1 and Mesa git master (2022-Dec-23). Also included: gzipped human readable API dump.

With mesa git debug build, I get this assert (instead of VK_ERROR_DEVICE_LOST error).

assert(bt_block->offset < 0); in anv_batch_chain.c:695

This issue occurs deterministically always after the same number of frames (depending on command buffer usage). I have provided a minimal example with just pipeline barriers, a blit and some timeline semaphore waits/signals which always errors after exactly 8192 frames.

If I add some drawing commands and more semaphore signals, the error occurs earlier, e.g. after 1280 frames (always some "round" number). If I remove almost everything, the error occurs after 16384 successful frames.

The number of successful frames before error stays constant from run to run. Timing is not a factor, adding some debug prints that slow the program down does not change the behavior. Reproduces perfectly in GDB debugger, even when stopped for a while. The app uses two threads but the issue reproduces with gfxrecon-replay in single thread.

When replaying the api traces with gfxrecon-replay the issue occurs earlier than when running the app natively, it seems like gfxrecon inserts some command buffers that affect the number of successful frames before error.

Backtrace from assert:

#4  0x00007ffff7cdf486 in __assert_fail () from /usr/lib/libc.so.6
#5  0x00007ffff667e81b in anv_cmd_buffer_alloc_binding_table (
    cmd_buffer=cmd_buffer@entry=0x7fff97c6b990, entries=entries@entry=1,
    state_offset=state_offset@entry=0x7ffff507946c)
    at ../src/intel/vulkan/anv_batch_chain.c:695
#6  0x00007ffff6683504 in anv_cmd_buffer_alloc_blorp_binding_table (
    cmd_buffer=cmd_buffer@entry=0x7fff97c6b990, num_entries=num_entries@entry=1,
    state_offset=state_offset@entry=0x7ffff507946c,
    bt_state=bt_state@entry=0x7ffff5079470) at ../src/intel/vulkan/anv_blorp.c:1055
#7  0x00007ffff66ba0a8 in blorp_alloc_binding_table (batch=0x7ffff5079e60,
    state_size=<optimized out>, state_alignment=<optimized out>,
    surface_maps=0x7ffff50794b0, surface_offsets=0x7ffff50794a8,
    bt_offset=<synthetic pointer>, num_entries=<optimized out>)
    at ../src/intel/vulkan/genX_blorp_exec.c:162
#8  blorp_setup_binding_table (batch=batch@entry=0x7ffff5079e60,
    params=params@entry=0x7ffff50796b0) at ../src/intel/blorp/blorp_genX_exec.h:1600
#9  0x00007ffff66bc9b0 in blorp_exec_3d (params=0x7ffff50796b0, batch=0x7ffff5079e60)
    at ../src/intel/blorp/blorp_genX_exec.h:2040
#10 blorp_exec (batch=0x7ffff5079e60, params=0x7ffff50796b0)
    at ../src/intel/blorp/blorp_genX_exec.h:2643
#11 0x00007ffff66bd350 in blorp_exec_on_render (params=0x7ffff50796b0,
    batch=0x7ffff5079e60) at ../src/intel/vulkan/genX_blorp_exec.c:305
#12 gfx9_blorp_exec (batch=0x7ffff5079e60, params=0x7ffff50796b0)
    at ../src/intel/vulkan/genX_blorp_exec.c:382
#13 0x00007ffff69ae7f8 in blorp_ccs_ambiguate (batch=batch@entry=0x7ffff5079e60,
    surf=surf@entry=0x7ffff5079e80, level=level@entry=0, layer=layer@entry=0)
    at ../src/intel/blorp/blorp_clear.c:1583
#14 0x00007ffff6685f51 in anv_image_ccs_op (cmd_buffer=cmd_buffer@entry=0x7fff97c6b990,
    image=image@entry=0x7fffec002890, format=ISL_FORMAT_R8G8B8A8_UNORM, swizzle=...,
    aspect=aspect@entry=VK_IMAGE_ASPECT_COLOR_BIT, level=level@entry=0, base_layer=0,
    layer_count=1, ccs_op=ISL_AUX_OP_AMBIGUATE, clear_value=0x0, predicate=false)
    at ../src/intel/vulkan/anv_blorp.c:1840
#15 0x00007ffff66c47de in transition_color_buffer (cmd_buffer=0x7fff97c6b990,
    image=<optimized out>, aspect=VK_IMAGE_ASPECT_COLOR_BIT, base_level=0,
    level_count=1, base_layer=0, layer_count=1,
    initial_layout=VK_IMAGE_LAYOUT_UNDEFINED,
    final_layout=VK_IMAGE_LAYOUT_COLOR_ATTACHMENT_OPTIMAL, src_queue_family=4294967295,
    dst_queue_family=4294967295, will_full_fast_clear=false)
    at ../src/intel/vulkan/genX_cmd_buffer.c:1256
#16 0x00007ffff66c4e27 in cmd_buffer_barrier (cmd_buffer=0x7fff97c6b990,
    dep_info=<optimized out>, reason=0x7ffff6cee345 "pipe barrier")
    at ../src/intel/vulkan/genX_cmd_buffer.c:3819
#17 0x00005555556a4601 in ash::device::Device::cmd_pipeline_barrier2 (           [0/10000]
    self=0x555555cd4750, command_buffer=..., dependency_info=0x7ffff507a6a8)
    at src/device.rs:117

Attachments

First repro case: minimal valid Vulkan application, with a command pool/buffer per frame, pipeline barriers, a blit from an unused image to swapchain, some timeline semaphore signals/waits and a present (using binary semaphores for sync). This example run always DEVICE_LOST/asserts after 8192 frames (two and a half minutes!), a bit less under gfxrecon-replay.

apidump.txt.gz

assert-repro.gfxr

Second repro case: I tried to further reduce this, and removed almost everything. It submits one empty command buffer per frame, and then presents. Only about a dozen Vulkan calls per frame. This is no longer valid Vulkan (pipeline barriers for swapchain image are missing), and this does NOT assert, but return DEVICE_LOST on debug builds too.

This example returns DEVICE_LOST after 16384 frames (almost five minutes).

apidump-minimal.txt.gz

repro-minimal.gfxr

On my Skylake laptop, this issue reproduces 100% of the time when replayed with gfxrecon-replay.

To reproduce issue run gfxrecon-replay assert-repro.gfxr.

GFXrecon version: I have captured (and replayed) the error using git commit 205a185f696dcd01d7188a30c7590eafb0c45c36 from https://github.com/LunarG/gfxreconstruct .

I can provide the source of test application (in private, on request) I am using but it shouldn't be necessary as the API traces reproduce the issue reliably.

Regression

I do not know. I have been seeing this behavior for a few months (late 2022), I don't know if this occurs with older version(s) of MESA.

Any extra information would be greatly appreciated

Let me know if I can help. I poked around in GDB with the bug but I don't know the codebase or the hw. To my eyeballs it looks like some chunk allocator runs out of chunks and should allocate more but doesn't.

Thank you for your time!

Edited Jan 19, 2023 by Riku Salminen

To upload designs, you'll need to enable LFS and have an admin enable hashed storage. More information

Admin message