Horizon Zero Dawn is SDMA limited due to kernel
We're debugging performance for Horizon Zero Dawn and at some point we see it being bottlenecked by SDMA.
Userspace details:
- Game is D3D12, run via VKD3D+radv
- radv will keep a global BO list using
AMDGPU_GEM_CREATE_VM_ALWAYS_VALID
containing application buffers. (due to the game using bindless)
We noticed that sometimes when switching areas the working set of application goes above the VRAM size and at point the sdma0
process takes half a core according to top
. Furthermore, when we trace such a state we can see it is essentially SDMA bottlenecked:
A further analysis of a more detailed trace brought the following points forward
- The SDMA operations are part of
amdgpu_vm_copy_ptes
- the
amdgpu_bo_move
events are for buffers of size 4096 (both the game and Xorg. pagetables?) - RADV is not doing a significant number of VA_OP ioctls (couple per frame at most, most frames have 0)
- Event counts in the trace are
-
amdgpu_bo_move
: 44761 -
amdgpu_vm_copy_ptes
: 4720478 -
amdgpu_cs_ioctl
: 18163 -
amdgpu_sched_run_job
: 2504478
-
- The issue seems to trigger when the buffers with VRAM domain in the global BO list hit ~8 GiB (on a 8 GiB GPU), but GTT is still pretty much unused( < 1 GiB).
The reason I think this is bugworthy is that that the number of VM updates and SDMA commandbuffers being run is way out of proportion compared to the number of evictions. The other thing is that I'm quite surprised the pagetables are getting evicted first. (Wouldn't it be better to evict something that can stay in GTT instead of being pingponged around?)
The only thing interesting in dmesg is
[136943.001109] [drm:amdgpu_gem_va_ioctl [amdgpu]] *ERROR* Couldn't update BO_VA (-16)
(I have a dmesg full of them, but they seems to be in burst and not continuously while the problem is ongoing)
This happens on a recent amdgpu-staging-drm-next, but is also reported by people on recently released upstream kernels so I'm assuming not much changed between versions.