improve venus fencing

!412 (merged) adds venus support but only vtest is modified to use the per-context fencing api. !501 (closed) modifies virgl_renderer_create_fence to create a per-context fence in all active contexts. As explained in the MR, it is just for testing venus in hypervisors. It is not how we want to support fencing. I would like to share my thoughts on how to improve venus fencing.

There will be four phases, with each new phase requiring more kernel features that are being worked on:

phase 1: no new requirement
phase 2: require kernel implicit fencing
phase 3: require kernel drm_syncobjs and sync resources
phase 4: require kernel dma-fences from sync resources

Phase 1

In this phase, venus uses userspace solutions and does not require new features from the kernel. It also means that we need to move away from sync-file-based idle waiting to ring-based busy waiting because of the lack of proper kernel support.

vkWaitForFences is translated to repeated vkGetFenceStatus. vkWaitSemaphores is translated to repeated vkGetSemaphoreCounterValue. While it seems like a step backward, and indeed it is, the reality is that when the guest idle waits, the host still has to call virgl_renderer_poll repeatedly, once per millisecond on crosvm. By translating waiting to repeated polling, venus moves the busy waiting logic from the host into the guest driver. This has a higher CPU overhead (to encode/decode commands), but should have a lower latency (by skipping KVM interrupts and adjusting timer frequency).

vkQueueWaitIdle and vkDeviceWaitldle are not performance-sensitive. vkDeviceWaitIdle can be translated to vkQueueWaitIdle, while vkQueueWaitIdle can be translated to a (empty) vkQueueSubmit and a vkWaitForFences.

External fences and semaphores are supported, to the degree that is required by Android. Android requires external fences and semaphores with "copy transference", as defined by the Vulkan extensions. In this phase, vkGet{Fence,Semaphore}FdKHR is translated to vkWait{ForFences,Semaphores} and returns -1 to indicate that the fence or semaphore has signaled.

WSI fencing with host-side waiting is not possible in this phase. Kernel implicit fencing is also not available. venus is left with guest-side waiting in the userspace, either in the driver or in the compositor. And since venus should not depend on modified compositors (e.g., Xorg), venus is required to vkQueueWaitIdle in vkQueuePresentKHR.

Phase 2

In this phase, venus requires kernel implicit fencing (wip branch).

This affects only WSI fencing. venus still needs to do guest-side waiting for WSI, but the waiting happens in the kernel rather than in the userspace. This allows venus to queue multiple frames of GPU work without forced vkQueueWaitIdle.

This should give a nice performance boost, probably the biggest one among all phases.

Phase 3

In this phase, venus requires VIRTIO_GPU_CMD_RESOURCE_CREATE_SYNC (wip branch).

vkWaitForFences, vkWaitSemaphores, vkQueueWaitIdle, and vkDeviceWaitIdle remain the same and do busy waiting.

For each external VkFence or VkSemaphore, a drm_syncobj is created. The host is expected to export a handle with "reference transference" from the host VkFence or VkSemaphore in response to VIRTIO_GPU_CMD_RESOURCE_CREATE_SYNC.

Guest drm_syncobjs (and dma_fences) are mainly used to identify the host handles in this phase. This enables host-side waiting for WSI.

When the guest compositor sees a sync file, instead of waiting on the sync file explicitly in the compositor or implicitly in the kernel, it sends the resource id of the sync file to the host. The host uses the resource id to look up the host VkFence or VkSemaphore, exports a host sync file, and sends the sync file to the host compositor.

This however requires explicit fencing and a modified compositor. Since X11 does not support explicit fencing, it will not get host-side waiting for WSI anyway (although there are kernel patches proposed to convert implicit fencing to explicit fencing).

Phase 4

In this phase, venus uses drm_syncobjs for idle waiting.

For each VkFence or timeline VkSemaphore, a drm_syncobj is created. vkWaitForFences and vkWaitSemaphores can just wait on the drm_syncobjs, or it can choose to busy wait a bit before falling back to idle wait on the drm_syncobjs.

What this needs is a virtio-gpu hypercall that expresses "generate an interrupt when these sync resources signal." That allows the guest kernel to use the interrupt to signal the associated dma-fence and wake up the userspace. On the host side, a thread pool or per-VkQueue threads are needed to do the waiting and interrupt generation.

This is similar to per-context fencing, which needs a hypercall to express "generate an interrupt when this host queue goes idle." To the guest, the difference is between "a host VkQueue and an object id to identify it" and "a host VkFence/VkSemaphore and a resource id to identify it." While the latter is more accurate, it also requires external fences and semaphores from the host driver. But since we need them to support host-side waiting for WSI (i.e., phase 3), it should not be an issue.

To upload designs, you'll need to enable LFS and have an admin enable hashed storage. More information

Admin message

improve venus fencing

Phase 1

Phase 2

Phase 3

Phase 4