CL: More misc fixes
This series has 3 critical bugfixes:
- Dominance indexing using 16-bit signed indices breaks 16-component sin/cos/tan libclc implementations.
- The system values lowering causes us to not respect work group offsets if no global offsets were specified. This breaks several 3-component math bruteforce tests, which end up using small work group sizes with large global dimensions, requiring us to loop them.
- Fix conformance of fdiv to match what CL requires, even if the D3D driver lowers it to separate reciprocal multiply.
It also has several non-critical changes:
- The vec3/vec4 pass allows copy_prop to work without using scratch memory when passing vec3 by value. Otherwise clang generates some bizarre code that normal copy_prop can't see through. This probably should be generalized to leverage OOB variable reads/writes in upstream.
- Add a bunch of optimizations early in the compilation, rather than just relying on the optimization loop in
nir_to_dxil
. This lets the code actually be readable beforelower_explicit_io
. - Fix some more hardcoded 4s that should be vector-sized. This ignores the ones that're already part of mesa/mesa!6655 (merged) -- I'll add these to that MR.