Skip to content

Draft: Add amdgpu hsakmt native context support for enable OpenCL based on AMD ROCm stack

Background:

This idea comes from virtgpu-native-context for mesa graphics virtgpu-native-context: Add msm native context.

The hsakmt is like drm in amdgpu ROCm compute stack, trying to make guest use native libhsakmt driver to enable amdgpu ROCm compute stack.

Currently the OpenCL is in progress. The libhsakmt is totally different with libdrm, more modifications is needed and this draft MR is in very early stage as far as I can see.

Implementation details:

  • Add libhsakmt backend, amd rbtree for memory management, new blob flag and create function.
  • Libhsakmt needs userptr feature that use the user space memory directly, called SVA/SVM generally, and the guest system memory need access by host libhsakmt directly, this is the first challenge in implementation. The implement of WSL GPADL (Guest Physical Address Descriptor Lists): WSL-GPADL is referenced, the guest user memory used in hsakmt native context is not moveable so that the backend driver and GPU hardware can access them with no data error. And we have a plan to bypass mmu notifior message into backend let guest user memory pin free.
  • The libhsakmt bo is address based not handle based different with libdrm. And the ROCm runtime submit the command use the libhsakmt bo address in guest directly. So we need mirror the guest address and host address this is the second challenge, the rbtree is used for manage the libhsakmt bo address keep all the bo address from libhsakmt is same with the guest address in reversed memory range.
  • The most difference between libhsakmt and libdrm is libhsakmt doesn't return file handle when open device. Libhsakmt ties with the process this is the third challenge. So different guest process share one real libhsakmt backend. And we are trying to modify the libshakmt let it support multi handle in one preocess. Or maybe trying to create a multi process backend stack is better.

V2:

  • Add VA check in some APIs need virtual address.
  • Add malloc check in all response malloc path.
  • Add failed/error handling in vamgr.
  • Remove hsakmt_set_frag_free and hsakmt_set_frag_used in vamgr
  • Remove all bool, void* in proto.
  • Add pad, static size check, -Wpadded in proto.
  • Use drm_* for logging.
  • Abort when dereserve fail.
  • Bypass some error return value into guest UMD.
  • Add bound check in some commands.
  • Modify meson build, libhsakmt backend requires drm and amdgpu drm now.
  • Modify code format into current style.

V3:

  • Use va_handle for guest blob mampped resource creating.
  • Fix VHSAKMT_CCMD.
  • Add else do { } while (false) in VHSA_CHECK_VA.
  • Fixe fail handling path in VHSAKMT_CCMD_QUERY_TILE_CONFIG.
  • Add new flag to ensure the AQL r/w memory free after AQL queue and guest BO.

V4:

  • Use drm_context functions in hsakmt device, but use vhsakmt_context* to replace drm_context* solution for further upgrade.
  • Add hsakmt_util.h to reuse drm_log or use it's own vhsa_log.
  • Add hsakmt_vm.c put virtual address manager in it.
  • Add vhsakmt backend for initializing vamgr.
  • Add libhsakmt virtio feature apis, rely on HSAKMT_VIRTIO to enable or disable.
  • Add hsakmt_hw.h for reuse drm_hw.h .

V5:

  • Add multi GPU support.
  • Add more query values in to proto instead of us payload.

V6:

  • Add hsakmt node get gfx version.The doorbell size needs gfx version to choice size, add node gfx version get function to handle the doorbell size.
  • Remove close_fd func.
  • Remove vhsakmt_fd_unmap.
  • Format the code. Let return value in different line with function names.
  • Merge gpu unmap code path with free memory code path.
  • Change vhsakmt_free_event_obj to void type.
  • Change AQL rw memory free logic.
  • Let AQL rw memory be freed in free_host_memory.
  • Change vhsakmt_aql_rw_mem_can_remove into vhsakmt_aql_rw_mem_can_free.
  • Format log in attach and get blob
  • Remove useless log.
  • Remove useless query type.
  • Move rsp alloc before assign rsp->ret value when query info.
  • Add a new function vhsakmt_alloc_host for not scratch memory alloc.
  • Rename some memory alloc functions.
  • Remove gpu mapped flag.
  • Remove alloc retry code.
  • Add new queue create function: vhsakmt_queue_create.
  • Split some queue create code into new functions.

TODO:

  • Create a new PR to replace this PR.

Performance:

Got 97% (13000/13300) performance vs bare metal in Geekbench6 by using OpenCL API in Xen hypervisor. The OpenCL CTS basic test all passed currently.

image.png

image.png

Enabled in KVM use AMD 6800XT, got 154939/ 157047 = 98% vs baremetal in Geekbench6.

Baremetal: https://browser.geekbench.com/v6/compute/3222054

KVM guest: https://browser.geekbench.com/v6/compute/3222067 image

Edited by Honglei Huang

Merge request reports

Loading