v3dv: implement VK_KHR_16bit_storage and VK_KHR_8bit_storage (!14648) · Merge requests · Mesa / mesa

Iago Toral requested to merge itoral/mesa:v3dv_khr_8bit_storage into main Jan 21, 2022

This series implements VK_KHR_16bit_storage and VK_KHR_8bit_storage. There are a few hw limitations to consider here:

TMU general vector access is restricted to 32-bit, so we can only do scalar 16-bit/8-bit load/store. To help with this, we lower general load/store to scalar when the bit-size is not 32-bit and we rely on the handy vectorization NIR pass to reconstruct these into equivalent vector or scalar 32-bit (or 16-bit) load/store that we can support (i.e. we can re-interpret an f16 vec2 load as 32-bit float load with some additional ALU instructions to extract the 16-bit data elements from the 32-bit result).
There are some optimized loading paths in the driver for 32-bit loads using ldunif and ldunifa instructions that we use for UBO and push constant loads. In some scenarios we may be able to use these also with 16-bit and 8-bit loads (specifically, when the 16-bit/8-bit value is in a 32-bit aligned address) but otherwise we end up demoting to general TMU access, which is slower. It should be noted that using this optimization means that we may end up with garbage in the MSB bits of 16-bit and 8-bit registers (since these instructions read 32-bit from memory) and thus, we need extra ALU when manipulating the values to limit the operations to the bit-size of interest (since we don't have native 16-bit and 8-bit instructions and we usually need to implement width conversions using 32-bit MOV instructions).
The extension requires that drivers support RTZ and RTE rounding modes on f32 to f16 conversions. Our hw seems to be doing RTE and doesn't support RTZ, so I implemented RTZ in software, which won't be optimal.

Edited Jan 25, 2022 by Iago Toral

v3dv: implement VK_KHR_16bit_storage and VK_KHR_8bit_storage

Merge request reports