OpenCL supports the concept of a global work ID offset at the API. This offset needs to be seen in two places:
- Explicit calls to
get_global_offsetbuiltins, which map to
- Calls to
To support this, I added two new system values:
- Global offset (base_global_invocation_id)
- Global ID without offset (global_invocation_id_zero_base)
If the driver/API doesn't call the new nir pass with the option which indicates that one of these offsets is present, then the driver continues to see the existing sysval intrinsics. If they do run this pass with the option, then the existing sysvals get lowered into the new ones. Furthermore, the D3D12 backend would prefer to keep the global ID all the way to the backend, so an option is added for that - but if that's not set then it gets lowered even further.
Something that's not explicit in the OpenCL kernel, but implicit through the API requirements, is that the driver may need to loop invocations, if the number of requested threads exceeds what can be done natively in hardware. To support this, the concept of work group ID offsets is also added. This allows looped dispatches to have different work group ID values, which can create the appearance of a single dispatch across the entire thread space. These are lowered similarly to the global ID offsets.