intel/compiler: fix derivative on y axis implementation

This rewrites the ddy in EXECUTE_4 mode with a loop to make it more
obvious what is going on and also sets the group each of the 4 threads
in the groups are supposed to execute.

Fixes the following CTS tests :


Signed-off-by: Lionel Landwerlin <>
Co-Authored-by: Jason Ekstrand <>
Reviewed-by: Matt Turner <>
Fixes: 2134ea38 ("intel/compiler/fs: Implement ddy without using align16 for Gen11+")
