radv/NIR: Suboptimal codegen for AccessChain on buffer device addresses
Given HLSL like this:
cbuffer UBO : register(b0)
{
float4 v[1024];
};
RWStructuredBuffer<float4> RW : register(u0);
[numthreads(64, 1, 1)]
void main_constant(uint thr : SV_DispatchThreadID)
{
RW[thr] = v[40];
}
[numthreads(64, 1, 1)]
void main_uniform(uint wg : SV_GroupID, uint thr : SV_DispatchThreadID)
{
RW[thr] = v[wg];
}
[numthreads(64, 1, 1)]
void main_dynamic(uint thr : SV_DispatchThreadID)
{
RW[thr] = v[thr];
}
This represents 3 different kinds of codegen. vkd3d-proton recently changed its codegen from manual pointer arithmetic to access chains since it allows the compiler to generate optimal code. For the constant address case:
we now get perfect codegen:
s_load_dwordx4 s[8:11], s[0:1], 0x280 ; f4080200 fa000280
where we used to get manual pointer arithmetic opcodes. This is by far the most common scenario, so this is good, but the dynamic indexing cases can be improved.
Wave uniform index:
s_ashr_i32 s6, s5, 31 ; 91069f05
s_lshl_b64 s[0:1], s[0:1], 4 ; 8f808400
s_add_u32 s6, s3, s0 ; 80060003
s_addc_u32 s7, s4, s1 ; 82070104
s_load_dwordx4 s[4:7], s[6:7], 0x0 ; f4080103 fa000000
Dynamic index:
v_lshl_add_u32 v0, s5, 6, v0 ; d7460000 04010c05
v_ashrrev_i32_e32 v1, 31, v0 ; 3002009f
v_mov_b32_e32 v2, v0 ; 7e040300
v_lshlrev_b64 v[0:1], 4, v[0:1] ; d6ff0000 00020084
v_add_co_u32 v4, vcc, s0, v0 ; d70f6a04 00020000
v_add_co_ci_u32_e32 v5, vcc, s4, v1, vcc ; 500a0204
global_load_dwordx4 v[4:7], v[4:5], off ; dc388000 047d0004
The issue here seems to be that NIR does a i32 -> i64 sign extension here since it must assume that the multiplication by 16 can overflow into 64-bit range or something. This also means a full 64-bit addition, 64-bit shifts, etc ...
First thing to note is that the SPIR-V here emits OpInBoundsAccessChain which should guarantee the index is in range 0 - 64 KiB and we'd get away with far fewer instructions. s_load_dword seems like it supports taking an offset from another SGPR as well, so I doubt we actually need to do this pointer arithmetic in 64-bit to begin with.