intel: Perform load_constant address math in 32-bit rather than 64-bit (!20999) · Merge requests · Mesa / mesa

Kenneth Graunke requested to merge kwg/mesa:constant-carries into main Jan 30, 2023

We lower NIR's load_constant to load_global_constant, which uses A64 bindless messages. As such, we do the following math to produce the address for each load:

base_lo@32 <- BRW_SHADER_RELOC_CONST_DATA_ADDR_LOW
base_hi@32 <- BRW_SHADER_RELOC_CONST_DATA_ADDR_HIGH
base@64 <- pack_64_2x32_split(base_lo, base_hi)
addr@64 <- iadd(base@64, u2u64(offset@32))

On platforms that emulate 64-bit math, we have to emit additional code for the 64-bit iadd to handle the possibility of a carry happening and affecting the top bits.

However, NIR constant data is always uploaded adjacent to the shader assembly, in the same buffer. These buffers are required to live in a 4GB region of memory starting at Instruction State Base Address. We always place the base address at a 4GB address. So the constant data always lives in a buffer entirely contained within a 4GB region, which means any offsets from the start of the buffer cannot possibly affect the high bits.

So instead, we can simply do a 32-bit addition between the low bits of the base and the offset, then pack that with the unchanged high bits.

On iris, IRIS_MEMZONE_SHADER is at [0, 4GB) so the high bits are always zero. We don't even need to patch that portion of the address and can simply use u2u64 to promote the 32-bit add result to a 64-bit value where the top bits are 0.

On anv, INSTRUCTION_STATE_POOL_MIN_ADDRESS is 8GB, so the high bits are always 0x2. We don't even need to patch that portion of the address and can just use an immediate value. We do still need to pack, however.

shader-db on Icelake indicates that this:

Helps instructions: -1.13% in 135 affected programs
Helps spills/fills: -4.08% / -4.18% in 4 affected programs
Gains us 1 SIMD16 compute shader instead of SIMD8

fossil-db on Icelake indicates the following for affected shaders:

Instrs: 10830023 -> 10750080 (-0.74%)
Cycles: 1048521282 -> 1046770379 (-0.17%); split: -0.33%, +0.16%
Subgroup size: 103104 -> 103112 (+0.01%)
Send messages: 570886 -> 570760 (-0.02%)
Loop count: 14428 -> 14429 (+0.01%)
Spill count: 14246 -> 14244 (-0.01%); split: -0.06%, +0.04%
Fill count: 22802 -> 22794 (-0.04%); split: -0.04%, +0.01%
Scratch Memory Size: 654336 -> 662528 (+1.25%)

+@idr +@llandwerlin +@cmarcelo

intel: Perform load_constant address math in 32-bit rather than 64-bit

Merge request reports