Skip to content

d3d12: Implement async video decode/processing and cross-engine implicit synchronization

Sil Vilerino requested to merge sivileri/mesa:implicit_engine_sync into main

This MR implements async video decode/processing and cross-engine implicit synchronization in the d3d12 gallium driver, removing some CPU synchronization unnecessary blocking and increasing performance.

The work is done in several parts:

  1. frontend/va: Pass the input surface fence on end_picture to encode/processing entrypoints too so the gallium driver can GPU-wait for completion before utilizing the surface on those entrypoints.
  2. frontend/va: Allow setting surface completion fences for PIPE_VIDEO_ENTRYPOINT_PROCESSING. This is done in the same way as !20133 (merged) does for decode, but since this operation takes an input and and output texture, we have one fence for each. Drivers can wait on the input fence before reading the input surface, and can assign a completion fence on the output surface.
  3. d3d12: Make decode and processing (encode already is) async and remove CPU waits/blocks.
  4. d3d12: Make decode, encode, processing wait on the input surface fences and assign the output completion fences. This way combinations of workloads will implicitly GPU-sync between each other.

Initial testing shows GPU utilization reaching 100%, as opposed to ~75% before for decode workloads, and GPU utilization reaching > 90% as opposed to ~66% before for transcode workloads.

Performance-wise, some decoding workloads run ~20% to 40% faster than before and are now on par with other native d3d11/d3d12 implementations.

cc @rdong @jenatali

Merge request reports