intel/compiler: tileY friendly LID order for CS
Compute shaders that access tileY resources (textures) benefit from Y-locality accesses. Accessing this in default, X-major fashion can generate partial writes and inefficient cache accesses since cache lines in tileY resources progress in Y-major direction.
Implemented two mechanism for improving tileY cache accesses by modifying how we group local IDs for CS:
- Walk LID in Y-major fashion (2D_texcoord=0,0 0,1 0,2 ...)
- Walk LID in balanced X/Y fashion. Same as default X-major except walking with blocks of 1x4. (2D_texcoord=0,0 0,1 0,2 0,3 1,0 1,1 1,2 1,3 2,0 ... max_x,2 max_x,3 0,4 0,5 0,6 0,7 1,4 ...)
Mechanism #1 is simple and functionally works in all situations, but can lead to performance regressions if both tileY and linear resources are used by CS. Mechanism #2 is more balanced and rarely regresses performance with mixed tileY/linear accesses. However it requires that height is a multiple of 4 and is more complex.
Improves performance on TGL: Borderlands3.dxvk-g2 +1.5%