Per-VMID GWS registers reset in between submissions
Hi,
I was testing out utilizing amdgpu's GWS functionality from a userspace app (without kfd in between) on my 6700 XT (also tested on vangogh). It works pretty well, but when I run the app exactly 8 times, it suddenly hangs.
The first 7 times, the kernel assigns vmid 1-7 for the app's submission (the app only submits once), but on the 8th run, the kernel re-uses vmid 1. On the first submission, the gws_size
and gws_base
members of the amdgpu_vmid
struct for vmid 1 were already set to the correct values, so on the 8th time, there is no GDS switch emitted which would update the device's GWS register values.
However, it seems that the GWS register doesn't keep the correct values, because when there is no GDS switch emitted, the GWS register values are all 0, and attempting to use GWS in that configuration hangs the GPU.
I can work around around the issue by making the kernel simply always emit GDS switches if a submission uses GWS. If that is an adequate fix, I could submit a patch to the mailing list, but I'm not completely sure if this actually fixes the issue properly or just works around the symptoms. The same code that handles GWS also handles GDS and OA, which AFAIK is used by drivers already. If the hardware's behaviour was to reset the per-VMID registers, I'm sure the issue would've been noticed with GDS or OA first?
If it isn't the hardware's behaviour to reset the GWS register, I assume something else in the kernel is probably overwriting the register, but I didn't find any other code emitting GDS switches or otherwise writing to the GWS registers outside of initialization.