improve venus fencing
!412 (merged) adds venus support but only vtest is modified to use the per-context fencing api. !501 (closed) modifies virgl_renderer_create_fence
to create a per-context fence in all active contexts. As explained in the MR, it is just for testing venus in hypervisors. It is not how we want to support fencing. I would like to share my thoughts on how to improve venus fencing.
There will be four phases, with each new phase requiring more kernel features that are being worked on:
- phase 1: no new requirement
- phase 2: require kernel implicit fencing
- phase 3: require kernel
drm_syncobj
s and sync resources - phase 4: require kernel
dma-fence
s from sync resources
Phase 1
In this phase, venus uses userspace solutions and does not require new features from the kernel. It also means that we need to move away from sync-file-based idle waiting to ring-based busy waiting because of the lack of proper kernel support.
vkWaitForFences
is translated to repeated vkGetFenceStatus
. vkWaitSemaphores
is translated to repeated vkGetSemaphoreCounterValue
. While it seems like a step backward, and indeed it is, the reality is that when the guest idle waits, the host still has to call virgl_renderer_poll
repeatedly, once per millisecond on crosvm. By translating waiting to repeated polling, venus moves the busy waiting logic from the host into the guest driver. This has a higher CPU overhead (to encode/decode commands), but should have a lower latency (by skipping KVM interrupts and adjusting timer frequency).
vkQueueWaitIdle
and vkDeviceWaitldle
are not performance-sensitive. vkDeviceWaitIdle
can be translated to vkQueueWaitIdle
, while vkQueueWaitIdle
can be translated to a (empty) vkQueueSubmit
and a vkWaitForFences
.
External fences and semaphores are supported, to the degree that is required by Android. Android requires external fences and semaphores with "copy transference", as defined by the Vulkan extensions. In this phase, vkGet{Fence,Semaphore}FdKHR
is translated to vkWait{ForFences,Semaphores}
and returns -1 to indicate that the fence or semaphore has signaled.
WSI fencing with host-side waiting is not possible in this phase. Kernel implicit fencing is also not available. venus is left with guest-side waiting in the userspace, either in the driver or in the compositor. And since venus should not depend on modified compositors (e.g., Xorg), venus is required to vkQueueWaitIdle
in vkQueuePresentKHR
.
Phase 2
In this phase, venus requires kernel implicit fencing (wip branch).
This affects only WSI fencing. venus still needs to do guest-side waiting for WSI, but the waiting happens in the kernel rather than in the userspace. This allows venus to queue multiple frames of GPU work without forced vkQueueWaitIdle
.
This should give a nice performance boost, probably the biggest one among all phases.
Phase 3
In this phase, venus requires VIRTIO_GPU_CMD_RESOURCE_CREATE_SYNC
(wip branch).
vkWaitForFences
, vkWaitSemaphores
, vkQueueWaitIdle
, and vkDeviceWaitIdle
remain the same and do busy waiting.
For each external VkFence
or VkSemaphore
, a drm_syncobj
is created. The host is expected to export a handle with "reference transference" from the host VkFence
or VkSemaphore
in response to VIRTIO_GPU_CMD_RESOURCE_CREATE_SYNC
.
Guest drm_syncobj
s (and dma_fence
s) are mainly used to identify the host handles in this phase. This enables host-side waiting for WSI.
When the guest compositor sees a sync file, instead of waiting on the sync file explicitly in the compositor or implicitly in the kernel, it sends the resource id of the sync file to the host. The host uses the resource id to look up the host VkFence
or VkSemaphore
, exports a host sync file, and sends the sync file to the host compositor.
This however requires explicit fencing and a modified compositor. Since X11 does not support explicit fencing, it will not get host-side waiting for WSI anyway (although there are kernel patches proposed to convert implicit fencing to explicit fencing).
Phase 4
In this phase, venus uses drm_syncobj
s for idle waiting.
For each VkFence
or timeline VkSemaphore
, a drm_syncobj
is created. vkWaitForFences
and vkWaitSemaphores
can just wait on the drm_syncobj
s, or it can choose to busy wait a bit before falling back to idle wait on the drm_syncobj
s.
What this needs is a virtio-gpu hypercall that expresses "generate an interrupt when these sync resources signal." That allows the guest kernel to use the interrupt to signal the associated dma-fence and wake up the userspace. On the host side, a thread pool or per-VkQueue
threads are needed to do the waiting and interrupt generation.
This is similar to per-context fencing, which needs a hypercall to express "generate an interrupt when this host queue goes idle." To the guest, the difference is between "a host VkQueue
and an object id to identify it" and "a host VkFence
/VkSemaphore
and a resource id to identify it." While the latter is more accurate, it also requires external fences and semaphores from the host driver. But since we need them to support host-side waiting for WSI (i.e., phase 3), it should not be an issue.