[Intel][Vulkan][Gen12] Vulkan compute shader is 3x slower than the same OpenCL kernel

Hi Intel Mesa team,

This issue comes from Google: https://bugs.chromium.org/p/tint/issues/detail?id=2049

When porting an OpenCL kernel into the exactly equivalent Vulkan compute shader, we find the Vulkan compute shader version is about 3x slower than the OpenCL one on Linux Intel Mesa driver. Could you help us investigate if we can improve the Mesa ISA code generator for better performance of the Vulkan compute shader?

Steps to reproduce:

Download and extract Vulkan_OpenCL.zip.
Run the OpenCL application HelloOpenCL in OpenCLTest/Build/ (or build it with CMake at OpenCLTest/)
Run the Vulkan application VulkanTest in VulkanTest/ (or build it by executing ./compile.sh at VulkanTest/) You can see the Vulkan application runs 3x slower than the OpenCL one.

Platform information:

OS: Ubuntu 23.04.1 (Kernel version: 6.2.0-32-generic)
GPU: Intel(R) Xe Graphics (TGL GT2) (device ID: 0x9A49)
Mesa driver version: 23.0.4

After checking the generated ISA from the OpenCL and Mesa driver, we find the Mesa driver generated much worse ISA than the OpenCL one. The Mesa compiler reports that it is using the lifo register scheduler which is the worst of the 4 options: top-down,non-lifo,none,lifo, then the ISA is not that efficient:

Below is part of the ISA of the Vulkan compute shader on Mesa driver

// There is write-read dependency among the `send` instruction and the following `mad` instructions on g9-g16 registers.
send(16)        g9UD            g69UD           nullUD          0x048050fe                0x00000000
dp data 1 MsgDesc: (untyped surface read, Surface = 254, SIMD16, Mask = 0x0)
                            mlen 2 ex_mlen 0 rlen 8 { align1 1H @1 $9 };
sync nop(1)                     null<0,1,0>UB                   { align1 WE_all 1N $9.dst };
mad(16)         g69<1>F         g123<8,8,1>F    g3<8,8,1>F      g9<1,1,1>F { align1 1H @4 $7.dst compacted };
mad(16)         g71<1>F         g125<8,8,1>F    g3<8,8,1>F      g11<1,1,1>F { align1 1H @4 $9.dst compacted };
mad(16)         g73<1>F         g65<8,8,1>F     g3<8,8,1>F      g13<1,1,1>F { align1 1H @4 $9.dst compacted };
mad(16)         g75<1>F         g67<8,8,1>F     g3<8,8,1>F      g15<1,1,1>F { align1 1H @4 $9.dst compacted };

Below is the part of the ISA of the OpenCL kernel on the Linux Intel OpenCL driver

// load two vec4s into one register
(W&f0.0.any16h) send.dc0 (8|M0)   r8      r2      null    0x0            0x021802FE           {@7,$13} // wr:1h+0, rd:1; oword block read x2
...
// then, later down, because the initial load is a preload
...
// then use 1 float from r8 in each of 8 `mad` instructions
mad (16|M0)              acc0.0<1>:f   r72.0<8;1>:f      r12.0<8;1>:f      r8.0<0>:f        {Compacted,$11.dst}
mad (16|M0)              r2.0<1>:f     r58.0<8;1>:f      r12.0<8;1>:f      r8.1<0>:f        {Compacted,$14.src}
mad (16|M0)              r46.0<1>:f    r56.0<8;1>:f      r12.0<8;1>:f      r8.2<0>:f        {Compacted}
mad (16|M0)              r32.0<1>:f    r54.0<8;1>:f      r12.0<8;1>:f      r8.3<0>:f        {Compacted,$15.src}
...
mad (16|M0)              acc0.0<1>:f   acc0.0<8;1>:f     r14.0<8;1>:f      r8.4<0>:massage: 
mad (16|M0)              r72.0<1>:f    r2.0<8;1>:f       r14.0<8;1>:f      r8.5<0>:f        {Compacted}
mad (16|M0)              r74.0<1>:f    r46.0<8;1>:f      r14.0<8;1>:f      r8.6<0>:f        {Compacted}
mad (16|M0)              r76.0<1>:f    r32.0<8;1>:f      r14.0<8;1>:f      r8.7<0>:f        {Compacted}

The ISA for HelloOpencL is OpenCLTest/HelloOpenCL.isa and the ISA for VulkanTest is VulkanTest/VulkanTest.isa.

Edited Oct 09, 2023 by Shao Jiawei

To upload designs, you'll need to enable LFS and have an admin enable hashed storage. More information