[LNL] xe.ko may be losing track of its syncobjs
Platform: Lunar Lake. I don't know if it happens in other platforms, I'm only testing LNL.
This issue happens rarely. It is way easier to reproduce when we're using a Release build of Mesa, and a non-debug Kernel build. Definitely sounds like some kind of race condition (but, from user-space, there is nothing to race against!). What happens is:
- We create a syncobj
- We submit a batch buffer telling it to signal this syncobj
- Then we wait for the syncobj to be signaled.
- The syncobj is never signaled. It always times out. And if we use an "infinite" timeout, it will stay there, waiting forever.
That's the gist of it. It can happen with the very first batch buffer we submit in Mesa, which is:
If we modify init_render_queue_state()
so that it calls anv_async_submit_wait(submit);
before returning, at the end, whenever this problem happens, it happens there: the very first batch buffer we submit in Mesa.
Here is the command submission function:
Here is the wait function:
which ultimately ends up calling:
Steps to reproduce
There are multiple ways to reproduce this issue. The one I've been using is:
for i in $(seq 5000); do echo "=== $i"; wm r ./deqp-vk -n dEQP-VK.sparse_resources.image_sparse_binding.1d.rg8ui.11_1_1; done
I don't think it ever took me more than 200 iterations before the problem appeared. Generally it's about 20.
The "wm r" command just makes sure deqp-vk is launched with my compiled release-mode mesa libraries.
If you enable dmesg debug, you'll see that the amount of ioctls submitted on each invocation is quite small:
[80202.073853] [drm:drm_stub_open [drm]]
[80202.073886] xe 0000:00:02.0: [drm:drm_open_helper [drm]] comm="deqp-vk", pid=89368, minor=128
[80202.073910] xe 0000:00:02.0: [drm:drm_ioctl [drm]] comm="deqp-vk" pid=89368, dev=0xe280, auth=0, XE_VM_CREATE
[80202.074657] xe 0000:00:02.0: [drm:drm_ioctl [drm]] comm="deqp-vk" pid=89368, dev=0xe280, auth=0, DRM_IOCTL_SYNCOBJ_CREATE
[80202.074681] xe 0000:00:02.0: [drm:drm_ioctl [drm]] comm="deqp-vk" pid=89368, dev=0xe280, auth=0, XE_GEM_CREATE
[80202.074729] xe 0000:00:02.0: [drm:drm_ioctl [drm]] comm="deqp-vk" pid=89368, dev=0xe280, auth=0, XE_GEM_MMAP_OFFSET
[80202.074754] xe 0000:00:02.0: [drm:drm_ioctl [drm]] comm="deqp-vk" pid=89368, dev=0xe280, auth=0, XE_VM_BIND
[80202.075100] xe 0000:00:02.0: [drm:drm_ioctl [drm]] comm="deqp-vk" pid=89368, dev=0xe280, auth=0, XE_GEM_CREATE
[80202.075159] xe 0000:00:02.0: [drm:drm_ioctl [drm]] comm="deqp-vk" pid=89368, dev=0xe280, auth=0, XE_GEM_MMAP_OFFSET
[80202.075181] xe 0000:00:02.0: [drm:drm_ioctl [drm]] comm="deqp-vk" pid=89368, dev=0xe280, auth=0, XE_VM_BIND
[80202.075446] xe 0000:00:02.0: [drm:drm_ioctl [drm]] comm="deqp-vk" pid=89368, dev=0xe280, auth=0, XE_GEM_CREATE
[80202.075487] xe 0000:00:02.0: [drm:drm_ioctl [drm]] comm="deqp-vk" pid=89368, dev=0xe280, auth=0, XE_GEM_MMAP_OFFSET
[80202.075501] xe 0000:00:02.0: [drm:drm_ioctl [drm]] comm="deqp-vk" pid=89368, dev=0xe280, auth=0, XE_VM_BIND
[80202.075672] xe 0000:00:02.0: [drm:drm_ioctl [drm]] comm="deqp-vk" pid=89368, dev=0xe280, auth=0, XE_GEM_CREATE
[80202.075692] xe 0000:00:02.0: [drm:drm_ioctl [drm]] comm="deqp-vk" pid=89368, dev=0xe280, auth=0, XE_GEM_MMAP_OFFSET
[80202.075705] xe 0000:00:02.0: [drm:drm_ioctl [drm]] comm="deqp-vk" pid=89368, dev=0xe280, auth=0, XE_VM_BIND
[80202.075849] xe 0000:00:02.0: [drm:drm_ioctl [drm]] comm="deqp-vk" pid=89368, dev=0xe280, auth=0, XE_GEM_CREATE
[80202.075874] xe 0000:00:02.0: [drm:drm_ioctl [drm]] comm="deqp-vk" pid=89368, dev=0xe280, auth=0, XE_GEM_MMAP_OFFSET
[80202.075887] xe 0000:00:02.0: [drm:drm_ioctl [drm]] comm="deqp-vk" pid=89368, dev=0xe280, auth=0, XE_VM_BIND
[80202.076092] xe 0000:00:02.0: [drm:drm_ioctl [drm]] comm="deqp-vk" pid=89368, dev=0xe280, auth=0, XE_GEM_CREATE
[80202.076213] xe 0000:00:02.0: [drm:drm_ioctl [drm]] comm="deqp-vk" pid=89368, dev=0xe280, auth=0, XE_GEM_MMAP_OFFSET
[80202.076239] xe 0000:00:02.0: [drm:drm_ioctl [drm]] comm="deqp-vk" pid=89368, dev=0xe280, auth=0, XE_VM_BIND
[80202.076430] xe 0000:00:02.0: [drm:drm_ioctl [drm]] comm="deqp-vk" pid=89368, dev=0xe280, auth=0, XE_GEM_CREATE
[80202.076449] xe 0000:00:02.0: [drm:drm_ioctl [drm]] comm="deqp-vk" pid=89368, dev=0xe280, auth=0, XE_GEM_MMAP_OFFSET
[80202.076462] xe 0000:00:02.0: [drm:drm_ioctl [drm]] comm="deqp-vk" pid=89368, dev=0xe280, auth=0, XE_VM_BIND
[80202.076610] xe 0000:00:02.0: [drm:drm_ioctl [drm]] comm="deqp-vk" pid=89368, dev=0xe280, auth=0, XE_GEM_CREATE
[80202.076624] xe 0000:00:02.0: [drm:drm_ioctl [drm]] comm="deqp-vk" pid=89368, dev=0xe280, auth=0, XE_VM_BIND
[80202.076789] xe 0000:00:02.0: [drm:drm_ioctl [drm]] comm="deqp-vk" pid=89368, dev=0xe280, auth=0, XE_GEM_CREATE
[80202.076803] xe 0000:00:02.0: [drm:drm_ioctl [drm]] comm="deqp-vk" pid=89368, dev=0xe280, auth=0, XE_GEM_MMAP_OFFSET
[80202.076815] xe 0000:00:02.0: [drm:drm_ioctl [drm]] comm="deqp-vk" pid=89368, dev=0xe280, auth=0, XE_VM_BIND
[80202.077190] xe 0000:00:02.0: [drm:drm_ioctl [drm]] comm="deqp-vk" pid=89368, dev=0xe280, auth=0, XE_GEM_CREATE
[80202.077239] xe 0000:00:02.0: [drm:drm_ioctl [drm]] comm="deqp-vk" pid=89368, dev=0xe280, auth=0, XE_GEM_MMAP_OFFSET
[80202.077251] xe 0000:00:02.0: [drm:drm_ioctl [drm]] comm="deqp-vk" pid=89368, dev=0xe280, auth=0, XE_VM_BIND
[80202.077586] xe 0000:00:02.0: [drm:drm_ioctl [drm]] comm="deqp-vk" pid=89368, dev=0xe280, auth=0, XE_EXEC_QUEUE_CREATE
[80202.077848] xe 0000:00:02.0: [drm:drm_ioctl [drm]] comm="deqp-vk" pid=89368, dev=0xe280, auth=0, DRM_IOCTL_SYNCOBJ_CREATE
[80202.077865] xe 0000:00:02.0: [drm:drm_ioctl [drm]] comm="deqp-vk" pid=89368, dev=0xe280, auth=0, XE_GEM_CREATE
[80202.077886] xe 0000:00:02.0: [drm:drm_ioctl [drm]] comm="deqp-vk" pid=89368, dev=0xe280, auth=0, XE_GEM_MMAP_OFFSET
[80202.077900] xe 0000:00:02.0: [drm:drm_ioctl [drm]] comm="deqp-vk" pid=89368, dev=0xe280, auth=0, XE_VM_BIND
[80202.078144] xe 0000:00:02.0: [drm:drm_ioctl [drm]] comm="deqp-vk" pid=89368, dev=0xe280, auth=0, XE_EXEC
[80202.078165] xe 0000:00:02.0: [drm:drm_ioctl [drm]] comm="deqp-vk" pid=89368, dev=0xe280, auth=0, DRM_IOCTL_SYNCOBJ_WAIT
[80203.089265] xe 0000:00:02.0: [drm:drm_ioctl [drm]] comm="deqp-vk", pid=89368, ret=-62
[80203.089448] xe 0000:00:02.0: [drm:drm_ioctl [drm]] comm="deqp-vk" pid=89368, dev=0xe280, auth=0, DRM_IOCTL_SYNCOBJ_WAIT
[80204.113042] xe 0000:00:02.0: [drm:drm_ioctl [drm]] comm="deqp-vk", pid=89368, ret=-62
[80204.113213] xe 0000:00:02.0: [drm:drm_ioctl [drm]] comm="deqp-vk" pid=89368, dev=0xe280, auth=0, DRM_IOCTL_SYNCOBJ_WAIT
[80205.137182] xe 0000:00:02.0: [drm:drm_ioctl [drm]] comm="deqp-vk", pid=89368, ret=-62
[80205.137364] xe 0000:00:02.0: [drm:drm_ioctl [drm]] comm="deqp-vk" pid=89368, dev=0xe280, auth=0, DRM_IOCTL_SYNCOBJ_WAIT
[80206.161413] xe 0000:00:02.0: [drm:drm_ioctl [drm]] comm="deqp-vk", pid=89368, ret=-62
[80206.161590] xe 0000:00:02.0: [drm:drm_ioctl [drm]] comm="deqp-vk" pid=89368, dev=0xe280, auth=0, DRM_IOCTL_SYNCOBJ_WAIT
I'll see if I can come up with a libdrm reproducer for this at some point.
Please pretty please this is a high priority issue since it affects Vulkan CTS and, as a consequence, Mesa CI.