intel: Perform load_constant address math in 32-bit rather than 64-bit
We lower NIR's load_constant to load_global_constant, which uses A64 bindless messages. As such, we do the following math to produce the address for each load:
base_lo@32 <- BRW_SHADER_RELOC_CONST_DATA_ADDR_LOW
base_hi@32 <- BRW_SHADER_RELOC_CONST_DATA_ADDR_HIGH
base@64 <- pack_64_2x32_split(base_lo, base_hi)
addr@64 <- iadd(base@64, u2u64(offset@32))
On platforms that emulate 64-bit math, we have to emit additional code for the 64-bit iadd to handle the possibility of a carry happening and affecting the top bits.
However, NIR constant data is always uploaded adjacent to the shader assembly, in the same buffer. These buffers are required to live in a 4GB region of memory starting at Instruction State Base Address. We always place the base address at a 4GB address. So the constant data always lives in a buffer entirely contained within a 4GB region, which means any offsets from the start of the buffer cannot possibly affect the high bits.
So instead, we can simply do a 32-bit addition between the low bits of the base and the offset, then pack that with the unchanged high bits.
On iris, IRIS_MEMZONE_SHADER
is at [0, 4GB)
so the high bits are always
zero. We don't even need to patch that portion of the address and can
simply use u2u64
to promote the 32-bit add result to a 64-bit value
where the top bits are 0.
On anv, INSTRUCTION_STATE_POOL_MIN_ADDRESS
is 8GB, so the high bits are
always 0x2
. We don't even need to patch that portion of the address and
can just use an immediate value. We do still need to pack, however.
shader-db on Icelake indicates that this:
- Helps instructions: -1.13% in 135 affected programs
- Helps spills/fills: -4.08% / -4.18% in 4 affected programs
- Gains us 1 SIMD16 compute shader instead of SIMD8
fossil-db on Icelake indicates the following for affected shaders:
Instrs: 10830023 -> 10750080 (-0.74%)
Cycles: 1048521282 -> 1046770379 (-0.17%); split: -0.33%, +0.16%
Subgroup size: 103104 -> 103112 (+0.01%)
Send messages: 570886 -> 570760 (-0.02%)
Loop count: 14428 -> 14429 (+0.01%)
Spill count: 14246 -> 14244 (-0.01%); split: -0.06%, +0.04%
Fill count: 22802 -> 22794 (-0.04%); split: -0.04%, +0.01%
Scratch Memory Size: 654336 -> 662528 (+1.25%)