intel/brw: Blockify convergent load_shared on Gfx11-12 as well
We were blockifying shared memory loads on LSC platforms (Alchemist/Meteorlake), but not HDC platforms (Icelake/Tigerlake/Alderlake/etc.). Using block loads substantially improves the performance of the Vulkan compute shader demo linked in #9960 (closed). Prior to this patch the Vulkan demo on anv was taking around 2-3x as long as the equivalent OpenCL demo on NEO. Now the Vulkan demo is only lagging behind the OpenCL one by ~20%. Still work to do, obviously, but substantially better now!
This should presumably improve a number of other compute shaders with convergent shared memory loads on Gen11-12.
intel/brw: Blockify convergent load_shared on Gfx11-12 as well
Gfx11-12 can support SLM block loads via OWord Block Load messages
(notably, the aligned version, not the unaligned version).
A while back we deleted the SHADER_OPCODE_OWORD_BLOCK_READ opcode.
Rather than bring it back, we continue using UNALIGNED_OWORD_BLOCK_READ
for SLM block access (like we do for SSBOs) but switch it over to the
aligned variant when lowering logical sends. We do ensure the alignment
is at least 16B, however. This is ugly, but it's probably not worth
bringing back a whole extra opcode for a legacy HDC block load quirk.
References: BSpec 47652 and 1689
Closes: https://gitlab.freedesktop.org/mesa/mesa/-/issues/9960