[Intel][Vulkan][Gen12] Vulkan compute shader is 3x slower than the same OpenCL kernel
Hi Intel Mesa team,
This issue comes from Google: https://bugs.chromium.org/p/tint/issues/detail?id=2049
When porting an OpenCL kernel into the exactly equivalent Vulkan compute shader, we find the Vulkan compute shader version is about 3x slower than the OpenCL one on Linux Intel Mesa driver. Could you help us investigate if we can improve the Mesa ISA code generator for better performance of the Vulkan compute shader?
Steps to reproduce:
- Download and extract Vulkan_OpenCL.zip.
- Run the OpenCL application
HelloOpenCL
in OpenCLTest/Build/ (or build it with CMake at OpenCLTest/) - Run the Vulkan application
VulkanTest
in VulkanTest/ (or build it by executing ./compile.sh at VulkanTest/) You can see the Vulkan application runs 3x slower than the OpenCL one.
Platform information:
- OS: Ubuntu 23.04.1 (Kernel version: 6.2.0-32-generic)
- GPU: Intel(R) Xe Graphics (TGL GT2) (device ID: 0x9A49)
- Mesa driver version: 23.0.4
After checking the generated ISA from the OpenCL and Mesa driver, we find the Mesa driver generated much worse ISA than the OpenCL one. The Mesa compiler reports that it is using the lifo
register scheduler which is the worst of the 4 options: top-down
,non-lifo
,none
,lifo
, then the ISA is not that efficient:
Below is part of the ISA of the Vulkan compute shader on Mesa driver
// There is write-read dependency among the `send` instruction and the following `mad` instructions on g9-g16 registers.
send(16) g9UD g69UD nullUD 0x048050fe 0x00000000
dp data 1 MsgDesc: (untyped surface read, Surface = 254, SIMD16, Mask = 0x0)
mlen 2 ex_mlen 0 rlen 8 { align1 1H @1 $9 };
sync nop(1) null<0,1,0>UB { align1 WE_all 1N $9.dst };
mad(16) g69<1>F g123<8,8,1>F g3<8,8,1>F g9<1,1,1>F { align1 1H @4 $7.dst compacted };
mad(16) g71<1>F g125<8,8,1>F g3<8,8,1>F g11<1,1,1>F { align1 1H @4 $9.dst compacted };
mad(16) g73<1>F g65<8,8,1>F g3<8,8,1>F g13<1,1,1>F { align1 1H @4 $9.dst compacted };
mad(16) g75<1>F g67<8,8,1>F g3<8,8,1>F g15<1,1,1>F { align1 1H @4 $9.dst compacted };
Below is the part of the ISA of the OpenCL kernel on the Linux Intel OpenCL driver
// load two vec4s into one register
(W&f0.0.any16h) send.dc0 (8|M0) r8 r2 null 0x0 0x021802FE {@7,$13} // wr:1h+0, rd:1; oword block read x2
...
// then, later down, because the initial load is a preload
...
// then use 1 float from r8 in each of 8 `mad` instructions
mad (16|M0) acc0.0<1>:f r72.0<8;1>:f r12.0<8;1>:f r8.0<0>:f {Compacted,$11.dst}
mad (16|M0) r2.0<1>:f r58.0<8;1>:f r12.0<8;1>:f r8.1<0>:f {Compacted,$14.src}
mad (16|M0) r46.0<1>:f r56.0<8;1>:f r12.0<8;1>:f r8.2<0>:f {Compacted}
mad (16|M0) r32.0<1>:f r54.0<8;1>:f r12.0<8;1>:f r8.3<0>:f {Compacted,$15.src}
...
mad (16|M0) acc0.0<1>:f acc0.0<8;1>:f r14.0<8;1>:f r8.4<0>:massage:
mad (16|M0) r72.0<1>:f r2.0<8;1>:f r14.0<8;1>:f r8.5<0>:f {Compacted}
mad (16|M0) r74.0<1>:f r46.0<8;1>:f r14.0<8;1>:f r8.6<0>:f {Compacted}
mad (16|M0) r76.0<1>:f r32.0<8;1>:f r14.0<8;1>:f r8.7<0>:f {Compacted}
The ISA for HelloOpencL
is OpenCLTest/HelloOpenCL.isa and the ISA for VulkanTest
is VulkanTest/VulkanTest.isa
.