cuda: Add support for async memory alloc/free
Because of implicit global synchronization behavior of sync alloc APIs such as cuMemAlloc and cuMemFree, frequent allocation and free can be a performance bottleneck. To address the issue, stream ordered allocation APIs were added in CUDA 11.2.
This MR adds support for the asynchronous CUDA memory alloc/free APIs (i.g., cuMemAllocAsync
, and cuMemFreeAsync
).
In addition to the async allocation feature, an interface is added for applications to be able to set CUDA memory pool to GStreamer,
so that allocated CUDA memory can be retained/reused.
GstCuda library / NVCODEC plugin will not use the async allocation by default since the async allocated memory
may not be compatible with other libraries, but users can enable the async allocation by setting
GST_CUDA_ENABLE_STREAM_ORDERED_ALLOC
environment variable or "prefer-stream-ordered-alloc" property of GstCudaContext.
Note that there are cases where synchronous CUDA memory allocation is unavoidable
- decoder: DPB memory is allocated/owned by NVDEC driver. Thus async allocation can be used only for decoder output buffer pool.
- encoder: NVENC does not seem to allow registering CUDA memory located in CUDA memory pool. encoder will propose sync alloc enabled CUDA buffer pool.