intel: Optimize compute workgroup sizes
There are many cases when a client may choose a workgroup size which is non-optimal for our hardware:
- It doesn't care about the local group size so it sets 1x1x1 and just uses the global group size. This means each shader thread will only be doing 1 unit of work rather than 8, 16, or 32.
- It chooses a large local group size which fits in a slice but doesn't fill the whole slice. This can happen often because we have non-power-of-two numbers of EUs per slice.
There are a number of possible optimizations here:
- For shaders which don't require barriers or SLM (shared variables), we can make the local workgroup size be 8, 16, or 32 depending on how we compile the shader. We then adjust all the various workgroup IDs we generate in the shader to make it look like it's running at the client's requested size.
- For 1x1x1 local workgroups which use SLM and/or barriers, we can move the SLM to normal local variables because there is only one invocation.
- For shaders where the entire local workgroup fits in a single SIMD8, SIMD16, or SIMD32 invocation, we can delete all barrier instructions.
- For small local workgroup sizes which use barriers or SLM, we can put multiple local workgroups into a single SIMD8, SIMD16, or SIMD32 workgroup. We just have to be careful with SLM to ensure that each workgroup gets its own SLM space. Likely, this means dividing up the SLM and doing an offset in the shader.
I'm not sure how all this works with variable workgroup sizes. Likely, only 1 and 4 work in that case.