intel/compiler: fix derivative on y axis implementation

This rewrites the ddy in EXECUTE_4 mode with a loop to make it more
obvious what is going on and also sets the group each of the 4 threads
in the groups are supposed to execute.

Fixes the following CTS tests :

   dEQP-VK.glsl.derivate.dfdyfine.dynamic_*

Signed-off-by: Lionel Landwerlin <lionel.g.landwerlin@intel.com>
Co-Authored-by: Jason Ekstrand <jason@jlekstrand.net>
Reviewed-by: Matt Turner <mattst88@gmail.com>
Fixes: 2134ea38 ("intel/compiler/fs: Implement ddy without using align16 for Gen11+")
10 jobs for !1187 with review/icl-ddy-fixes in 9 minutes and 18 seconds (queued for 1 second)
detached
Status Job ID Name Coverage
  Containers Build
passed debian #385458

00:00:18

 
  Build+Test
passed meson-clang #385460

00:06:57

passed meson-clover #385464

00:08:01

passed meson-main #385463

00:04:23

passed meson-swr-glvnd #385459

00:06:49

passed meson-vulkan #385465

00:02:28

passed scons-llvm #385467

00:03:34

passed scons-nollvm #385466

00:03:35

passed scons-swr #385461

00:07:59

passed scons-win64 #385462

00:08:59