mesa merge requestshttps://gitlab.freedesktop.org/mesa/mesa/-/merge_requests2021-05-05T19:38:28Zhttps://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/5029WIP: freedreno/ir3: vector(ish) prep2021-05-05T19:38:28ZRob ClarkWIP: freedreno/ir3: vector(ish) prepadreno a3xx+ (ir3) has a sorta vector mode, using a `(rptN)` (for alu instructions) to repeat an instruction 1-3 times. The destination register increments (to the next successive scalar register) for each repeat, and src registers with...adreno a3xx+ (ir3) has a sorta vector mode, using a `(rptN)` (for alu instructions) to repeat an instruction 1-3 times. The destination register increments (to the next successive scalar register) for each repeat, and src registers with the `(r)` flag increment. Meaning that srcs can have either `.xxx` or `.xyz` swizzles. The notable benefit of this mode is that src registers without `(r)` get loaded a single time, rather than once per instruction. (And I *think* it helps the shader core pipeline to better prefetch src registers.) This turns out to help a lot in certain cases (ie. anything other than a bunch of 2src alu ops) where GPR read bandwidth can become the bottleneck.
This MR doesn't turn on vectorizing yet.. RA still needs some more work, plus whatever other bugs I've not found yet. But this is the part of my vectorish patch stack that I think is ready(ish) to get some eyes on.https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/8961WIP: ir3: add shader instrumentation to dump registers content2021-05-05T19:43:17ZDanylo PiliaievWIP: ir3: add shader instrumentation to dump registers contentThe goal of this is to provide a way to inspect what's in the registers of a shader. And while we don't have a way for interactive debugging, we could instrument a shader to dump the contents of its registers.
It is not in a final state...The goal of this is to provide a way to inspect what's in the registers of a shader. And while we don't have a way for interactive debugging, we could instrument a shader to dump the contents of its registers.
It is not in a final state - I'm looking for a feedback on the decisions being made, and suggestions on how to proceed further.
Note, this MR contains https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/8717 in order to use global load/store/atomics.
(Works on a6xx + Turnip at the moment)
Here is an example of current output of the instrumentation of every instruction for a simple shader with a loop:
```
IR3_SHADER_DEBUG=fs IR3_SHADER_INSTRUMENT=72715a1ef3eb336b914c18f608682c325d860c2c IR3_SHADER_INSTRUMENT_INSTR_REGEX= vkrunner loop.shader_test
```
```
// len = 4
color = vec4(0);
for (int i = 0; i < len; i++)
{
color += vec4(arr[i] / 50.f);
}
```
<details>
<summary>IR3 (click me)</summary>
```
Native code for unnamed FRAG shader (null) with sha1 72715a1ef3eb336b914c18f608682c325d860c2c:
SIMD0
@out(r0.w) out0 (wrmask=0xf)
@const(c133.x) 0x3ca3d70a, 0xd0d0d0d0, 0xd0d0d0d0, 0xd0d0d0d0
mov.u32u32 r1.x, 0
(rpt2)nop
mov.u32u32 r0.w, r1.x
(rpt2)nop
mov.u32u32 r0.z, r0.w
(rpt2)nop
mov.u32u32 r0.y, r0.z
(rpt2)nop
mov.u32u32 r0.x, r0.y
(jp)(nop3) cmps.s.ge p0.x, r1.x, c0.x
(rpt2)nop
br p0.x, #l25
(jp)(nop3) shl.b r1.y, r1.x, 2
cov.u32s16 hr2.w, r1.y
(rpt2)nop
mova a0.x, hr2.w
(rpt5)nop
(ul)cov.s32f32 r1.y, c<a0.x + 4>
(rpt2)nop
mad.f32 r0.w, c133.x, r1.y, r0.w
mad.f32 r0.z, c133.x, r1.y, r0.z
mad.f32 r0.y, c133.x, r1.y, r0.y
mad.f32 r0.x, c133.x, r1.y, r0.x
add.u r1.x, r1.x, 1
jump #l9
l25:
(jp)mov.u32u32 r1.x, r0.z
mov.u32u32 r1.y, r0.y
mov.u32u32 r1.z, r0.x
end
nop
nop
nop
nop
; FRAG: outputs: r0.w (FRAG_RESULT_DATA0)
; FRAG: inputs: r63.x (SYSTEM_VALUE_BARYCENTRIC_PERSP_PIXEL slot=50 cm=3,il=0,b=0)
; FRAG prog 2/1: 54 instr, 33 nops, 21 non-nops, 9 mov, 2 cov, 66 dwords
; FRAG prog 2/1: 0 last-baryf, 0 half, 2 full, 136 constlen
; FRAG prog 2/1: 36 cat0, 11 cat1, 9 cat2, 4 cat3, 0 cat4, 0 cat5, 0 cat6, 0 cat7,
; FRAG prog 2/1: 0 sstall, 0 (ss), 0 (sy), 0 max_sun, 1 loops
; data0: r0.w
```
</details>
<details>
<summary>Instrumented IR3 (click me)</summary>
```
Native code for INTRUMENTED FRAG shader (null) with sha1 72715a1ef3eb336b914c18f608682c325d860c2c:
@out(r0.w) out0 (wrmask=0xf)
@const(c133.x) 0x3ca3d70a, 0xd0d0d0d0, 0xd0d0d0d0, 0xd0d0d0d0
nop
nop
mov.u32u32 r6.y, 0
mov.u32u32 r6.x, 0x6a88004
(rpt3)nop
atomic.g.inc.untyped.1d.u32.1.g r7.x, r6.x, r6.x
(ss)nop
mov.u32u32 r6.x, 0x6a88000
mov.u32u32 r6.w, 0
(ss)nop
mov.u32u32 r6.z, 12
(rpt3)nop
atomic.g.add.untyped.1d.u32.1.g r6.z, r6.x, r6.z
mov.u32u32 r7.y, 0
(rpt3)nop
(sy)stg.u32 g[r6.z], r7.x, 2
mov.u32u32 r1.x, 0
(ss)mov.u32u32 r7.y, 2
(rpt3)nop
(sy)stg.u32 g[r6.z+r7.y], r1.x, 1
(rpt2)nop
(ss)nop
mov.u32u32 r6.z, 16
(rpt3)nop
atomic.g.add.untyped.1d.u32.1.g r6.z, r6.x, r6.z
mov.u32u32 r7.y, 1
(rpt3)nop
(sy)stg.u32 g[r6.z], r7.x, 2
(ss)mov.u32u32 r7.y, 3
(rpt3)nop
(sy)stg.u32 g[r6.z+r7.y], r1.x, 1
mov.u32u32 r0.w, r1.x
(ss)mov.u32u32 r7.y, 2
(rpt3)nop
(sy)stg.u32 g[r6.z+r7.y], r0.w, 1
(rpt2)nop
(ss)nop
mov.u32u32 r6.z, 16
(rpt3)nop
atomic.g.add.untyped.1d.u32.1.g r6.z, r6.x, r6.z
mov.u32u32 r7.y, 2
(rpt3)nop
(sy)stg.u32 g[r6.z], r7.x, 2
(ss)mov.u32u32 r7.y, 3
(rpt3)nop
(sy)stg.u32 g[r6.z+r7.y], r0.w, 1
mov.u32u32 r0.z, r0.w
(ss)mov.u32u32 r7.y, 2
(rpt3)nop
(sy)stg.u32 g[r6.z+r7.y], r0.z, 1
(rpt2)nop
(ss)nop
mov.u32u32 r6.z, 16
(rpt3)nop
atomic.g.add.untyped.1d.u32.1.g r6.z, r6.x, r6.z
mov.u32u32 r7.y, 3
(rpt3)nop
(sy)stg.u32 g[r6.z], r7.x, 2
(ss)mov.u32u32 r7.y, 3
(rpt3)nop
(sy)stg.u32 g[r6.z+r7.y], r0.z, 1
mov.u32u32 r0.y, r0.z
(ss)mov.u32u32 r7.y, 2
(rpt3)nop
(sy)stg.u32 g[r6.z+r7.y], r0.y, 1
(rpt2)nop
(ss)nop
mov.u32u32 r6.z, 16
(rpt3)nop
atomic.g.add.untyped.1d.u32.1.g r6.z, r6.x, r6.z
mov.u32u32 r7.y, 4
(rpt3)nop
(sy)stg.u32 g[r6.z], r7.x, 2
(ss)mov.u32u32 r7.y, 3
(rpt3)nop
(sy)stg.u32 g[r6.z+r7.y], r0.y, 1
mov.u32u32 r0.x, r0.y
(ss)mov.u32u32 r7.y, 2
(rpt3)nop
(sy)stg.u32 g[r6.z+r7.y], r0.x, 1
(ss)nop
mov.u32u32 r6.z, 12
(rpt3)nop
atomic.g.add.untyped.1d.u32.1.g r6.z, r6.x, r6.z
mov.u32u32 r7.y, 5
(rpt3)nop
(sy)stg.u32 g[r6.z], r7.x, 2
(ss)mov.u32u32 r7.y, 2
(rpt3)nop
(sy)stg.u32 g[r6.z+r7.y], r1.x, 1
(jp)(nop3) cmps.s.ge p0.x, r1.x, c0.x
(rpt2)nop
(ss)nop
mov.u32u32 r6.z, 16
(rpt3)nop
atomic.g.add.untyped.1d.u32.1.g r6.z, r6.x, r6.z
mov.u32u32 r7.y, 6
(rpt3)nop
(sy)stg.u32 g[r6.z], r7.x, 2
(ss)mov.u32u32 r7.y, 2
(rpt3)nop
(sy)stg.u32 g[r6.z+r7.y], r0.x, 1
(ss)mov.u32u32 r7.y, 3
(rpt3)nop
(sy)stg.u32 g[r6.z+r7.y], r0.x, 1
br p0.x, #l242
(ss)nop
mov.u32u32 r6.z, 16
(rpt3)nop
atomic.g.add.untyped.1d.u32.1.g r6.z, r6.x, r6.z
mov.u32u32 r7.y, 7
(rpt3)nop
(sy)stg.u32 g[r6.z], r7.x, 2
(ss)mov.u32u32 r7.y, 3
(rpt3)nop
(sy)stg.u32 g[r6.z+r7.y], r1.x, 1
(jp)(nop3) shl.b r1.y, r1.x, 2
(ss)mov.u32u32 r7.y, 2
(rpt3)nop
(sy)stg.u32 g[r6.z+r7.y], r1.y, 1
(ss)nop
mov.u32u32 r6.z, 16
(rpt3)nop
atomic.g.add.untyped.1d.u32.1.g r6.z, r6.x, r6.z
mov.u32u32 r7.y, 8
(rpt3)nop
(sy)stg.u32 g[r6.z], r7.x, 2
(ss)mov.u32u32 r7.y, 3
(rpt3)nop
(sy)stg.u32 g[r6.z+r7.y], r1.y, 1
cov.u32s16 hr2.w, r1.y
(ss)mov.u32u32 r7.y, 4
(rpt3)nop
(sy)stg.u16 g[r6.z+r7.y], r2.w, 1
(rpt2)nop
(ss)nop
mov.u32u32 r6.z, 12
(rpt3)nop
atomic.g.add.untyped.1d.u32.1.g r6.z, r6.x, r6.z
mov.u32u32 r7.y, 9
(rpt3)nop
(sy)stg.u32 g[r6.z], r7.x, 2
(ss)mov.u32u32 r7.y, 4
(rpt3)nop
(sy)stg.u16 g[r6.z+r7.y], r2.w, 1
mova a0.x, hr2.w
(rpt5)nop
(ss)nop
mov.u32u32 r6.z, 12
(rpt3)nop
atomic.g.add.untyped.1d.u32.1.g r6.z, r6.x, r6.z
mov.u32u32 r7.y, 10
(rpt3)nop
(sy)stg.u32 g[r6.z], r7.x, 2
(ul)cov.s32f32 r1.y, c<a0.x + 4>
(ss)mov.u32u32 r7.y, 2
(rpt3)nop
(sy)stg.u32 g[r6.z+r7.y], r1.y, 1
(rpt2)nop
(ss)nop
mov.u32u32 r6.z, 20
(rpt3)nop
atomic.g.add.untyped.1d.u32.1.g r6.z, r6.x, r6.z
mov.u32u32 r7.y, 11
(rpt3)nop
(sy)stg.u32 g[r6.z], r7.x, 2
(ss)mov.u32u32 r7.y, 3
(rpt3)nop
(sy)stg.u32 g[r6.z+r7.y], r1.y, 1
(ss)mov.u32u32 r7.y, 4
(rpt3)nop
(sy)stg.u32 g[r6.z+r7.y], r0.w, 1
mad.f32 r0.w, c133.x, r1.y, r0.w
(ss)mov.u32u32 r7.y, 2
(rpt3)nop
(sy)stg.u32 g[r6.z+r7.y], r0.w, 1
(ss)nop
mov.u32u32 r6.z, 20
(rpt3)nop
atomic.g.add.untyped.1d.u32.1.g r6.z, r6.x, r6.z
mov.u32u32 r7.y, 12
(rpt3)nop
(sy)stg.u32 g[r6.z], r7.x, 2
(ss)mov.u32u32 r7.y, 3
(rpt3)nop
(sy)stg.u32 g[r6.z+r7.y], r1.y, 1
(ss)mov.u32u32 r7.y, 4
(rpt3)nop
(sy)stg.u32 g[r6.z+r7.y], r0.z, 1
mad.f32 r0.z, c133.x, r1.y, r0.z
(ss)mov.u32u32 r7.y, 2
(rpt3)nop
(sy)stg.u32 g[r6.z+r7.y], r0.z, 1
(ss)nop
mov.u32u32 r6.z, 20
(rpt3)nop
atomic.g.add.untyped.1d.u32.1.g r6.z, r6.x, r6.z
mov.u32u32 r7.y, 13
(rpt3)nop
(sy)stg.u32 g[r6.z], r7.x, 2
(ss)mov.u32u32 r7.y, 3
(rpt3)nop
(sy)stg.u32 g[r6.z+r7.y], r1.y, 1
(ss)mov.u32u32 r7.y, 4
(rpt3)nop
(sy)stg.u32 g[r6.z+r7.y], r0.y, 1
mad.f32 r0.y, c133.x, r1.y, r0.y
(ss)mov.u32u32 r7.y, 2
(rpt3)nop
(sy)stg.u32 g[r6.z+r7.y], r0.y, 1
(ss)nop
mov.u32u32 r6.z, 20
(rpt3)nop
atomic.g.add.untyped.1d.u32.1.g r6.z, r6.x, r6.z
mov.u32u32 r7.y, 14
(rpt3)nop
(sy)stg.u32 g[r6.z], r7.x, 2
(ss)mov.u32u32 r7.y, 3
(rpt3)nop
(sy)stg.u32 g[r6.z+r7.y], r1.y, 1
(ss)mov.u32u32 r7.y, 4
(rpt3)nop
(sy)stg.u32 g[r6.z+r7.y], r0.x, 1
mad.f32 r0.x, c133.x, r1.y, r0.x
(ss)mov.u32u32 r7.y, 2
(rpt3)nop
(sy)stg.u32 g[r6.z+r7.y], r0.x, 1
(ss)nop
mov.u32u32 r6.z, 16
(rpt3)nop
atomic.g.add.untyped.1d.u32.1.g r6.z, r6.x, r6.z
mov.u32u32 r7.y, 15
(rpt3)nop
(sy)stg.u32 g[r6.z], r7.x, 2
(ss)mov.u32u32 r7.y, 3
(rpt3)nop
(sy)stg.u32 g[r6.z+r7.y], r1.x, 1
add.u r1.x, r1.x, 1
(ss)mov.u32u32 r7.y, 2
(rpt3)nop
(sy)stg.u32 g[r6.z+r7.y], r1.x, 1
jump #l77
l242:
(ss)nop
mov.u32u32 r6.z, 16
(rpt3)nop
atomic.g.add.untyped.1d.u32.1.g r6.z, r6.x, r6.z
mov.u32u32 r7.y, 16
(rpt3)nop
(sy)stg.u32 g[r6.z], r7.x, 2
(ss)mov.u32u32 r7.y, 3
(rpt3)nop
(sy)stg.u32 g[r6.z+r7.y], r0.z, 1
(jp)mov.u32u32 r1.x, r0.z
(ss)mov.u32u32 r7.y, 2
(rpt3)nop
(sy)stg.u32 g[r6.z+r7.y], r1.x, 1
(ss)nop
mov.u32u32 r6.z, 16
(rpt3)nop
atomic.g.add.untyped.1d.u32.1.g r6.z, r6.x, r6.z
mov.u32u32 r7.y, 17
(rpt3)nop
(sy)stg.u32 g[r6.z], r7.x, 2
(ss)mov.u32u32 r7.y, 3
(rpt3)nop
(sy)stg.u32 g[r6.z+r7.y], r0.y, 1
mov.u32u32 r1.y, r0.y
(ss)mov.u32u32 r7.y, 2
(rpt3)nop
(sy)stg.u32 g[r6.z+r7.y], r1.y, 1
(ss)nop
mov.u32u32 r6.z, 16
(rpt3)nop
atomic.g.add.untyped.1d.u32.1.g r6.z, r6.x, r6.z
mov.u32u32 r7.y, 18
(rpt3)nop
(sy)stg.u32 g[r6.z], r7.x, 2
(ss)mov.u32u32 r7.y, 3
(rpt3)nop
(sy)stg.u32 g[r6.z+r7.y], r0.x, 1
mov.u32u32 r1.z, r0.x
(ss)mov.u32u32 r7.y, 2
(rpt3)nop
(sy)stg.u32 g[r6.z+r7.y], r1.z, 1
end
nop
nop
nop
nop
; FRAG: outputs: r0.w (FRAG_RESULT_DATA0)
; FRAG: inputs: r63.x (SYSTEM_VALUE_BARYCENTRIC_PERSP_PIXEL slot=50 cm=3,il=0,b=0)
; FRAG prog 2/1: 541 instr, 363 nops, 178 non-nops, 89 mov, 2 cov, 578 dwords
; FRAG prog 2/1: 0 last-baryf, 0 half, 8 full, 136 constlen
; FRAG prog 2/1: 366 cat0, 91 cat1, 9 cat2, 4 cat3, 0 cat4, 0 cat5, 77 cat6, 0 cat7,
; FRAG prog 2/1: 0 sstall, 58 (ss), 57 (sy), 0 max_sun, 1 loops
; data0: r0.w
```
</details>
<details>
<summary>Output (click me)</summary>
```
Data written 13625002
Invocations 62500
[0/0]: mov.u32u32 r1.x, 0
dst(r1.x)=00000000 /* 0.000000 */
[0/1]: mov.u32u32 r0.w, r1.x
dst(r0.w)=00000000 /* 0.000000 */ src(r1.x)=00000000 /* 0.000000 */
[0/2]: mov.u32u32 r0.z, r0.w
dst(r0.z)=00000000 /* 0.000000 */ src(r0.w)=00000000 /* 0.000000 */
[0/3]: mov.u32u32 r0.y, r0.z
dst(r0.y)=00000000 /* 0.000000 */ src(r0.z)=00000000 /* 0.000000 */
[0/4]: mov.u32u32 r0.x, r0.y
dst(r0.x)=00000000 /* 0.000000 */ src(r0.y)=00000000 /* 0.000000 */
[0/5]: (jp)(nop3) cmps.s.ge p0.x, r1.x, c0.x
src(r1.x)=00000000 /* 0.000000 */
[0/6]: br p0.x, #14
src(r0.x)=00000000 /* 0.000000 */ src(r0.x)=00000000 /* 0.000000 */
[0/7]: (jp)(nop3) shl.b r1.y, r1.x, 2
dst(r1.y)=00000000 /* 0.000000 */ src(r1.x)=00000000 /* 0.000000 */
[0/8]: cov.u32s16 hr2.w, r1.y
dst(hr2.w)=00000000 /* 0.000000 */ src(r1.y)=00000000 /* 0.000000 */
[0/9]: mova a0.x, hr2.w
src(hr2.w)=00000000 /* 0.000000 */
[0/10]: (ul)cov.s32f32 r1.y, c<a0.x + 4>
dst(r1.y)=0x40a00000 /* 5.000000 */
[0/11]: mad.f32 r0.w, c133.x, r1.y, r0.w
dst(r0.w)=0x3dcccccc /* 0.100000 */ src(r1.y)=0x40a00000 /* 5.000000 */ src(r0.w)=0x3dcccccc /* 0.100000 */
[0/12]: mad.f32 r0.z, c133.x, r1.y, r0.z
dst(r0.z)=0x3dcccccc /* 0.100000 */ src(r1.y)=0x40a00000 /* 5.000000 */ src(r0.z)=0x3dcccccc /* 0.100000 */
[0/13]: mad.f32 r0.y, c133.x, r1.y, r0.y
dst(r0.y)=0x3dcccccc /* 0.100000 */ src(r1.y)=0x40a00000 /* 5.000000 */ src(r0.y)=0x3dcccccc /* 0.100000 */
[0/14]: mad.f32 r0.x, c133.x, r1.y, r0.x
dst(r0.x)=0x3dcccccc /* 0.100000 */ src(r1.y)=0x40a00000 /* 5.000000 */ src(r0.x)=0x3dcccccc /* 0.100000 */
[0/15]: add.u r1.x, r1.x, 1
dst(r1.x)=0x3dcccccc /* 0.100000 */ src(r1.x)=0x000001 /* 0.000000 */
[0/5]: (jp)(nop3) cmps.s.ge p0.x, r1.x, c0.x
src(r1.x)=0x000001 /* 0.000000 */
[0/6]: br p0.x, #14
src(r0.x)=0x3dcccccc /* 0.100000 */ src(r0.x)=0x3dcccccc /* 0.100000 */
[0/7]: (jp)(nop3) shl.b r1.y, r1.x, 2
dst(r1.y)=0x000004 /* 0.000000 */ src(r1.x)=0x000001 /* 0.000000 */
[0/8]: cov.u32s16 hr2.w, r1.y
dst(hr2.w)=0x000004 /* 0.000000 */ src(r1.y)=0x040004 /* 0.000000 */
[0/9]: mova a0.x, hr2.w
src(hr2.w)=0x000004 /* 0.000000 */
[0/10]: (ul)cov.s32f32 r1.y, c<a0.x + 4>
dst(r1.y)=0x41200000 /* 10.000000 */
[0/11]: mad.f32 r0.w, c133.x, r1.y, r0.w
dst(r0.w)=0x3e999999 /* 0.300000 */ src(r1.y)=0x41200000 /* 10.000000 */ src(r0.w)=0x3e999999 /* 0.300000 */
[0/12]: mad.f32 r0.z, c133.x, r1.y, r0.z
dst(r0.z)=0x3e999999 /* 0.300000 */ src(r1.y)=0x41200000 /* 10.000000 */ src(r0.z)=0x3e999999 /* 0.300000 */
[0/13]: mad.f32 r0.y, c133.x, r1.y, r0.y
dst(r0.y)=0x3e999999 /* 0.300000 */ src(r1.y)=0x41200000 /* 10.000000 */ src(r0.y)=0x3e999999 /* 0.300000 */
[0/14]: mad.f32 r0.x, c133.x, r1.y, r0.x
dst(r0.x)=0x3e999999 /* 0.300000 */ src(r1.y)=0x41200000 /* 10.000000 */ src(r0.x)=0x3e999999 /* 0.300000 */
[0/15]: add.u r1.x, r1.x, 1
dst(r1.x)=0x3e999999 /* 0.300000 */ src(r1.x)=0x000002 /* 0.000000 */
[0/5]: (jp)(nop3) cmps.s.ge p0.x, r1.x, c0.x
src(r1.x)=0x000002 /* 0.000000 */
[0/6]: br p0.x, #14
src(r0.x)=0x3e999999 /* 0.300000 */ src(r0.x)=0x3e999999 /* 0.300000 */
[0/7]: (jp)(nop3) shl.b r1.y, r1.x, 2
dst(r1.y)=0x000008 /* 0.000000 */ src(r1.x)=0x000002 /* 0.000000 */
[0/8]: cov.u32s16 hr2.w, r1.y
dst(hr2.w)=0x000008 /* 0.000000 */ src(r1.y)=0x080008 /* 0.000000 */
[0/9]: mova a0.x, hr2.w
src(hr2.w)=0x000008 /* 0.000000 */
[0/10]: (ul)cov.s32f32 r1.y, c<a0.x + 4>
dst(r1.y)=0x41700000 /* 15.000000 */
[0/11]: mad.f32 r0.w, c133.x, r1.y, r0.w
dst(r0.w)=0x3f199999 /* 0.600000 */ src(r1.y)=0x41700000 /* 15.000000 */ src(r0.w)=0x3f199999 /* 0.600000 */
[0/12]: mad.f32 r0.z, c133.x, r1.y, r0.z
dst(r0.z)=0x3f199999 /* 0.600000 */ src(r1.y)=0x41700000 /* 15.000000 */ src(r0.z)=0x3f199999 /* 0.600000 */
[0/13]: mad.f32 r0.y, c133.x, r1.y, r0.y
dst(r0.y)=0x3f199999 /* 0.600000 */ src(r1.y)=0x41700000 /* 15.000000 */ src(r0.y)=0x3f199999 /* 0.600000 */
[0/14]: mad.f32 r0.x, c133.x, r1.y, r0.x
dst(r0.x)=0x3f199999 /* 0.600000 */ src(r1.y)=0x41700000 /* 15.000000 */ src(r0.x)=0x3f199999 /* 0.600000 */
[0/15]: add.u r1.x, r1.x, 1
dst(r1.x)=0x3f199999 /* 0.600000 */ src(r1.x)=0x000003 /* 0.000000 */
[0/5]: (jp)(nop3) cmps.s.ge p0.x, r1.x, c0.x
src(r1.x)=0x000003 /* 0.000000 */
[0/6]: br p0.x, #14
src(r0.x)=0x3f199999 /* 0.600000 */ src(r0.x)=0x3f199999 /* 0.600000 */
[0/7]: (jp)(nop3) shl.b r1.y, r1.x, 2
dst(r1.y)=0x00000c /* 0.000000 */ src(r1.x)=0x000003 /* 0.000000 */
[0/8]: cov.u32s16 hr2.w, r1.y
dst(hr2.w)=0x00000c /* 0.000000 */ src(r1.y)=0x0c000c /* 0.000000 */
[0/9]: mova a0.x, hr2.w
src(hr2.w)=0x00000c /* 0.000000 */
[0/10]: (ul)cov.s32f32 r1.y, c<a0.x + 4>
dst(r1.y)=0x41a00000 /* 20.000000 */
[0/11]: mad.f32 r0.w, c133.x, r1.y, r0.w
dst(r0.w)=0x3f7fffff /* 1.000000 */ src(r1.y)=0x41a00000 /* 20.000000 */ src(r0.w)=0x3f7fffff /* 1.000000 */
[0/12]: mad.f32 r0.z, c133.x, r1.y, r0.z
dst(r0.z)=0x3f7fffff /* 1.000000 */ src(r1.y)=0x41a00000 /* 20.000000 */ src(r0.z)=0x3f7fffff /* 1.000000 */
[0/13]: mad.f32 r0.y, c133.x, r1.y, r0.y
dst(r0.y)=0x3f7fffff /* 1.000000 */ src(r1.y)=0x41a00000 /* 20.000000 */ src(r0.y)=0x3f7fffff /* 1.000000 */
[0/14]: mad.f32 r0.x, c133.x, r1.y, r0.x
dst(r0.x)=0x3f7fffff /* 1.000000 */ src(r1.y)=0x41a00000 /* 20.000000 */ src(r0.x)=0x3f7fffff /* 1.000000 */
[0/15]: add.u r1.x, r1.x, 1
dst(r1.x)=0x3f7fffff /* 1.000000 */ src(r1.x)=0x000004 /* 0.000000 */
[0/5]: (jp)(nop3) cmps.s.ge p0.x, r1.x, c0.x
src(r1.x)=0x000004 /* 0.000000 */
[0/6]: br p0.x, #14
src(r0.x)=0x3f7fffff /* 1.000000 */ src(r0.x)=0x3f7fffff /* 1.000000 */
[0/16]: (jp)mov.u32u32 r1.x, r0.z
dst(r1.x)=0x3f7fffff /* 1.000000 */ src(r0.z)=0x3f7fffff /* 1.000000 */
[0/17]: mov.u32u32 r1.y, r0.y
dst(r1.y)=0x3f7fffff /* 1.000000 */ src(r0.y)=0x3f7fffff /* 1.000000 */
[0/18]: mov.u32u32 r1.z, r0.x
dst(r1.z)=0x3f7fffff /* 1.000000 */ src(r0.x)=0x3f7fffff /* 1.000000 */
```
</details>
Current decisions:
- Shader is instrumented after all compilation is done in order to be able to work with overriden shaders (via `IR3_SHADER_OVERRIDE_PATH`) and not to change the register allocation (which may be undesirable?).
- So it is done after RA, meaning we need some free regs (since there is no spilling in RA at the moment - it's not that big of an issue I think);
- Running RA after instrumentation may be useful after there is a support of spilling. However, the other way would be to reduce the upper limit of registers in RA for the shader we want to instrument in order to always have some free regs;
- Without RA pass the jump offsets are manually retargeted;
- Shader to instrument is targeted by its hash (the very same that is used for override) via `IR3_SHADER_INSTRUMENT`, the instructions for which registers should be dumped could be filtered via `IR3_SHADER_INSTRUMENT_INSTR_REGEX` e.g. `IR3_SHADER_INSTRUMENT_INSTR_REGEX="\(sy\)stg\.u32"`;
- The space for registers is allocated with per-instruction granularity, meaning the the instructions are interleaved in the global buffer.
- Pros: we could dump arbitrary number of instruction from one shader invocation;
- Cons: it is slower because it requires doing an `atomic.add` for every instruction and waiting for its result. An alternative is to pre-allocate memory for each invocation and calculating offset without global atomics.
In global buffer the dump of each instruction has a following structure:
```
invocation_id: u32
instruction_id: u32
dst_values: u32[num_of_dest_regs]
src_values: u32[num_of_src_regs]
```
Unavoidable limitations:
- Requires a6xx+ GPU due to usage of global load/store/atomics;
- The instrumentation **will** mess up with the cases where the bug is due to improper synchronization between instructions or shader invocations;
- Dumping all registers for all invocations at once may be too much for medium/large shaders with many invocations, both due to memory and time constraints.
Current limitations/issues:
- Registers are written one by one instead of writing up to four at once. Writing one by one is slower but requires less registers;
- The registers are dumped for all invocations. Which could be infeasible if we want to dump registers for each instruction in huge shader which has too many invocations. I think the solution would be to have a shader binary with non-instrumented and instrumented code together, then decide which one to run based on some condition;
- Some instructions write several registers or consume several registers from one source, most of them aren't handled properly at the moment.
- Currently only the result of first invocation is printed
- There is no check for clashing with input registers?
- Constant registers aren't printed
My current plan:
- Leave one by one writing of registers (doesn't affect the end result);
- Do not pre-allocate space in global memory (doesn't affect the end result);
- Handle the instructions which write more than one register and read several registers from one src;
- Get a feedback on the output format and improve it based on that;
- Add a way to control which invocation(s) are printed;
In any case, I'd like to get a feedback before proceeding further.https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/9717ir3/validate: allow a1 as source only for cat5 with A1EN2021-05-05T19:43:11ZDanylo Piliaievir3/validate: allow a1 as source only for cat5 with A1ENAnd ignore a1 size while validating ca5 - it is always half-reg.
Without this validation fails on:
```
sam.base0 (f32)(xyzw)r0.x, r0.z, s#1, a1.x
```
(if validating deserialized from assembly ir3)And ignore a1 size while validating ca5 - it is always half-reg.
Without this validation fails on:
```
sam.base0 (f32)(xyzw)r0.x, r0.z, s#1, a1.x
```
(if validating deserialized from assembly ir3)https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/12655Draft: freedreno/ir3: pre-RA scheduler tuning2021-11-03T17:20:59ZRob ClarkDraft: freedreno/ir3: pre-RA scheduler tuningLooking at #5307 I noticed that pre-RA sched was making a proper mess of the attached shader. The problem was that it was preferring picking an instruction that wouldn't defer but increased register pressure over an instruction that wou...Looking at #5307 I noticed that pre-RA sched was making a proper mess of the attached shader. The problem was that it was preferring picking an instruction that wouldn't defer but increased register pressure over an instruction that would defer but decreased register pressure (or was neutral). We probably want to prioritize reducing register pressure harder in pre-RA scheduling.
Marked as draft because I've not had time to go thru all of shader-db and perf traces tests yet, and a bit undecided on the last two patches. But probably the current limit of 8 outstanding tex fetches is too high.
[skia-34.shader_test](/uploads/8a9c00b3b0cafb7ebf27fdc1fbe0cdb1/skia-34.shader_test)https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/15788WIP: ir3: invsr instruction2022-04-07T10:00:36ZDanylo PiliaievWIP: ir3: invsr instructionTODO: Find what `invsr` does.
The SRC has to be shared reg. Only produced for fragment shaders, dst has to be dummy.
Accidentally found it by looking again at GTA V pipeline for cables where I had troubles with tesselation in the past...TODO: Find what `invsr` does.
The SRC has to be shared reg. Only produced for fragment shaders, dst has to be dummy.
Accidentally found it by looking again at GTA V pipeline for cables where I had troubles with tesselation in the past.
Is it **INV**alidate **S**ha**R**ed? Or something else... I was not able to understand its function with computerator.
Here is the context (shader_test for vkrunner, cmdstream, and blob's disassembly log): [invsr.tar.gz](/uploads/a55d909350107210f2484b63a9cd84fb/invsr.tar.gz)
This instruction happens a few more times throughout the shader dump I made by running fossilize of many d3d11 games on blob, but surrounding context is always similar. And the shared reg is never used after `invsr`.
```
0[203100f4_00000000] mova a0, hc0.x;
1[201560fc_000000c0] (ul)invsr dr63.x, sr48.x;
2[03820000_0000000f] shps #15;
3[02820000_0000000e] getone #14;
4[204880f5_00000000] mova1 a1, 0;
5[00000500_00000000] (rpt5)nop ;
6[c0360e03_0cc78100] ldc.4.k.mode4.base0.x c[a1], 12, 7;
7[204890f5_00000010] (ss)mova1 a1, 16;
8[00000500_00000000] (rpt5)nop ;
9[c0361002_0cc78100] ldc.3.k.mode4.base0.x c[a1], 12, 8;
10[204890f5_00000020] (ss)mova1 a1, 32;
11[00000500_00000000] (rpt5)nop ;
12[c0361212_28c78100] ldc.19.k.mode4.base0.x c[a1], 40, 9;
13[204890f5_00000070] (ss)mova1 a1, 112;
14[00000500_00000000] (rpt5)nop ;
15[c0361401_0cc78100] ldc.2.k.mode4.base0.x c[a1], 12, 10;
16[14021000_00000000] (sy)(ss)shpe ;
17[4f300002_00002007] (jp)bary.f r0.z, 7, r0.x;
18[47300904_00002004] (rpt1)bary.f r1.x, (r)4, r0.x;
...
```https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/24376ir3: Add EOLM and EOGM a7xx flags to NOP2023-09-18T11:41:44ZDanylo Piliaievir3: Add EOLM and EOGM a7xx flags to NOP### What does this MR do and why?
<!-- Describe in detail what your merge request does and why. -->
ir3: Add EOLM and EOGM a7xx flags to NOP
Apparently the ignored bits have meaning.
- EOLM - Is set on a NOP after the last cat6 instruc...### What does this MR do and why?
<!-- Describe in detail what your merge request does and why. -->
ir3: Add EOLM and EOGM a7xx flags to NOP
Apparently the ignored bits have meaning.
- EOLM - Is set on a NOP after the last cat6 instruction. Must be set outside of control flow including preambles. Doesn't seem to affect correctness.
- EOGM - Is set on a NOP after the last cat5/cat6 instruction.
----
Impact on perf is not tested.https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/25077ir3/a7xx: Add post-RA pass to track liveness and insert (last)2024-01-01T06:24:56ZMark Collinsir3/a7xx: Add post-RA pass to track liveness and insert (last)Introduces a backwards dataflow analysis pass to determine when a certain register is always written to prior to being read in a similar manner to SSA liveness but performed after RA which we can use to determine when we can insert `(las...Introduces a backwards dataflow analysis pass to determine when a certain register is always written to prior to being read in a similar manner to SSA liveness but performed after RA which we can use to determine when we can insert `(last)` on src regs on A7XX.
Observations:
- **Conformance:** These changes pass all mustpass VK-CTS tests under `dEQP-VK.pipeline.*`
- **Performance:** There was next to no difference in the performance of 3DMark Wild Life (Normal/Extreme), it's possible that any performance advantage provided by this is being entirely overshadowed by the costs of sysmem rendering.https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/25916nir: I/O vector access improvements2024-04-03T17:39:53ZFaith Ekstrandnir: I/O vector access improvementsThis MR does two things:
1. On NVIDIA, we can indirect access anything, including within a vector and I'd like to avoid lowering to if-ladders if we can. The first 4 commits of this MR make it so that we can indirect on compact variable...This MR does two things:
1. On NVIDIA, we can indirect access anything, including within a vector and I'd like to avoid lowering to if-ladders if we can. The first 4 commits of this MR make it so that we can indirect on compact variables such as tess levels and clip/cull distances. The annoying bit is that this means changing the interface of the `type_size` callback to `nir_lower_io()` which involves touching a lot of drivers.
2. We also need to get SPIR-V doing the right thing on TCS outputs. Right now, if the SPIR-V has a write to a single component of a vector, `spirv_to_nir` emits a load/insert/store pattern which is potentially racy. On NVIDIA, there are CTS tests which actually hit this race so I need this for passing CTS. We have a NIR pass which lowers writes of this form to an if-ladder with write-masks which is what we use for most drivers.MR Label MakerMR Label Makerhttps://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/27462freedreno, turnip, ir3: Early preamble2024-04-29T09:42:34ZConnor Abbottfreedreno, turnip, ir3: Early preambleIn addition to introducing the scalar ALU, in a650 a copy of the scalar ALU and some other units were added to the HLSQ, which dispatches work to the uSPTPs (shader cores), and it can now execute the preamble part of shaders "early," i.e...In addition to introducing the scalar ALU, in a650 a copy of the scalar ALU and some other units were added to the HLSQ, which dispatches work to the uSPTPs (shader cores), and it can now execute the preamble part of shaders "early," i.e. before work is dispatched, rather than as part of the first wave dispatched to each uSPTP. This can help hide the latency of executing the preamble. Traditionally, the HLSQ also prefetched various state via the `CP_LOAD_STATE` packet, but recently more and more of this functionality has been moving to the preamble, with the implicit expectation that it is executed in an early preamble:
- Since a730 shared consts (Vulkan push constants) are setup in the preamble.
- Since a730 descriptors are prefetched in the preamble.
- Since a750 `CP_LOAD_STATE` to setup constants is now deprecated and severely limited, so most driver params come from UBOs that are pushed to the constant file in a preamble.
As more and more things are being executed in the preamble, hiding the latency becomes more important.
We can't always execute a preamble early. Early preambles cannot have "normal" (i.e. not shared) registers or predicate registers (so they cannot have control flow). If the preamble contains these, then we have to fall back to using it as a normal "late" preamble.
This MR implements early preamble, based on !22075 which implements the scalar ALU. While this doesn't actually depend on that series, without scalar ALU the cases we can use early preamble are severly limited.https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/27776WIP: tu, ir3: VK_KHR_shader_atomic_int64 for >a7402024-05-22T07:30:25ZAmber HarmoniaWIP: tu, ir3: VK_KHR_shader_atomic_int64 for >a740Passes CTS on a740 + custom tests.
I have not been able to test this on real applications (UE5, etc) yet, so just mr-ing for review.
Passing all of `dEQP-VK.glsl.atomic_operations.*64bit*`Passes CTS on a740 + custom tests.
I have not been able to test this on real applications (UE5, etc) yet, so just mr-ing for review.
Passing all of `dEQP-VK.glsl.atomic_operations.*64bit*`https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/28254Draft: tu: KHR_8bit_storage support2024-05-13T11:00:51ZZan DobersekDraft: tu: KHR_8bit_storage supportSupport for KHR_8bit_storage in Turnip. Addresses #9979.Support for KHR_8bit_storage in Turnip. Addresses #9979.https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/28341ir3: add support for repeated instructions2024-05-14T14:57:20ZJob Noormanir3: add support for repeated instructionsir3 is a scalar architecture and as such most instructions cannot be vectorized. However, many instructions support the `(rptN)` modifier that allows us to mimic vector instructions. Whenever an instruction has the `(rptN)` modifier set ...ir3 is a scalar architecture and as such most instructions cannot be vectorized. However, many instructions support the `(rptN)` modifier that allows us to mimic vector instructions. Whenever an instruction has the `(rptN)` modifier set it will execute N more time, incrementing its destination register for each repetition. Additionally, source registers with the `(r)` flag set will also be incremented.
For example:
```
(rpt1)add.f r0.x, (r)r1.x, r2.x
```
is the same as:
```
add.f r0.x, r1.x, r2.x
add.f r0.y, r1.y, r2.x
```
The main benefit of using repeated instructions is a reduction in code size. Since every iteration is still executed as a scalar instruction, there's no direct benefit in terms of runtime. The only exception seems to be for 3-source instructions pre-a7xx: if one of the sources is constant (i.e., without the `(r)` flag), a repeated instruction executes faster than the equivalent expanded sequence. Presumably, this is because the ALU only has 2 register read ports. I have not been able to measure this difference on a7xx though.
Support for repeated instructions consists of two parts. First, we need to make sure NIR is (mostly) vectorized when translating to ir3. I have not been able to find a way to keep NIR vectorized all the way and still generate decent code. Therefore, I have taken the approach of vectorizing the (scalarized) NIR right before translating it to ir3.
Secondly, ir3 needs to be adapted to ingest vectorized NIR and translate it to repeated instructions. To this end, I have introduced the concept of "repeat groups" to ir3. A repeat group is a group of instructions that were produced from a vectorized NIR operation and linked together. They are, however, still separate scalar instructions until quite late.
More concretely:
1. Instruction emission: for every vectorized NIR operation, emit separate scalar instructions for its components and link them together in a repeat group. For every instruction builder `ir3_X`, a new repeat builder `ir3_X_rpt` has been added to facilitate this.
2. Optimization passes: for now, repeat groups are completely ignored by optimizations.
3. Pre-RA: clean up repeat groups that can never be merged into an actual `rptN` instruction (e.g., because their instructions are not consecutive anymore). This ensures no useless merge sets will be created in the next step.
4. RA: create merge sets for the sources and defs of the instructions in repeat groups. This way, RA will try to allocate consecutive registers for them. This will not be forced though because we prefer to split-up repeat groups over creating movs to reorder registers.
5. Post-RA: create actual `rptN` instructions for repeat groups where the allocated registers allow it.
The idea for step 2 is that we prefer that any potential optimizations take precedence over creating `rptN` instructions as the latter will only yield a code size benefit. However, it might be interesting to investigate if we could make some optimizations repeat aware. For example, the scheduler could try to schedule instructions of a repeat group together.
# Results
The total code size reduction on shader-db is 10.14%.
<details>
<summary>Details</summary>
<pre>
total instructions in shared programs: 4179917 -> 4151747 (-0.67%)
instructions in affected programs: 3509729 -> 3481559 (-0.80%)
helped: 12025
HURT: 8185
helped stats (abs) min: 1 max: 1051 x̄: 6.16 x̃: 2
helped stats (rel) min: 0.05% max: 36.99% x̄: 4.08% x̃: 2.78%
HURT stats (abs) min: 1 max: 681 x̄: 5.62 x̃: 3
HURT stats (rel) min: 0.06% max: 50.45% x̄: 4.26% x̃: 3.06%
95% mean confidence interval for instructions value: -1.72 -1.07
95% mean confidence interval for instructions %-change: -0.78% -0.62%
Instructions are helped.
total nops in shared programs: 933098 -> 901164 (-3.42%)
nops in affected programs: 806898 -> 774964 (-3.96%)
helped: 11827
HURT: 7808
helped stats (abs) min: 1 max: 423 x̄: 6.14 x̃: 2
helped stats (rel) min: 0.23% max: 100.00% x̄: 38.09% x̃: 33.33%
HURT stats (abs) min: 1 max: 588 x̄: 5.21 x̃: 3
HURT stats (rel) min: 0.00% max: 2300.00% x̄: 47.45% x̃: 16.05%
95% mean confidence interval for nops value: -1.93 -1.32
95% mean confidence interval for nops %-change: -5.24% -2.91%
Nops are helped.
total non-nops in shared programs: 3246819 -> 3250583 (0.12%)
non-nops in affected programs: 1309486 -> 1313250 (0.29%)
helped: 2240
HURT: 2425
helped stats (abs) min: 1 max: 802 x̄: 4.18 x̃: 3
helped stats (rel) min: 0.05% max: 29.55% x̄: 2.45% x̃: 1.65%
HURT stats (abs) min: 1 max: 155 x̄: 5.42 x̃: 3
HURT stats (rel) min: 0.03% max: 50.45% x̄: 3.16% x̃: 1.44%
95% mean confidence interval for non-nops value: 0.26 1.35
95% mean confidence interval for non-nops %-change: 0.34% 0.60%
Non-nops are HURT.
total mov in shared programs: 162669 -> 166980 (2.65%)
mov in affected programs: 81329 -> 85640 (5.30%)
helped: 2184
HURT: 2394
helped stats (abs) min: 1 max: 77 x̄: 3.35 x̃: 2
helped stats (rel) min: 0.69% max: 100.00% x̄: 44.05% x̃: 37.50%
HURT stats (abs) min: 1 max: 155 x̄: 4.86 x̃: 3
HURT stats (rel) min: 0.00% max: 2500.00% x̄: 64.78% x̃: 25.00%
95% mean confidence interval for mov value: 0.71 1.17
95% mean confidence interval for mov %-change: 9.65% 16.06%
Mov are HURT.
total cov in shared programs: 89791 -> 89810 (0.02%)
cov in affected programs: 898 -> 917 (2.12%)
helped: 1
HURT: 7
helped stats (abs) min: 3 max: 3 x̄: 3.00 x̃: 3
helped stats (rel) min: 1.22% max: 1.22% x̄: 1.22% x̃: 1.22%
HURT stats (abs) min: 1 max: 16 x̄: 3.14 x̃: 1
HURT stats (rel) min: 0.66% max: 1600.00% x̄: 229.49% x̃: 0.88%
95% mean confidence interval for cov value: -2.37 7.12
95% mean confidence interval for cov %-change: -272.05% 673.36%
Inconclusive result (value mean confidence interval includes 0).
total dwords in shared programs: 9029932 -> 8113958 (-10.14%)
dwords in affected programs: 7871742 -> 6955768 (-11.64%)
helped: 25697
HURT: 153
helped stats (abs) min: 2 max: 3226 x̄: 35.83 x̃: 32
helped stats (rel) min: 0.09% max: 62.50% x̄: 20.26% x̃: 16.67%
HURT stats (abs) min: 2 max: 192 x̄: 30.44 x̃: 30
HURT stats (rel) min: 0.35% max: 26.09% x̄: 5.70% x̃: 5.21%
95% mean confidence interval for dwords value: -36.18 -34.69
95% mean confidence interval for dwords %-change: -20.29% -19.93%
Dwords are helped.
total last-baryf in shared programs: 138838 -> 145081 (4.50%)
last-baryf in affected programs: 76911 -> 83154 (8.12%)
helped: 471
HURT: 811
helped stats (abs) min: 1 max: 118 x̄: 8.75 x̃: 4
helped stats (rel) min: 0.41% max: 100.00% x̄: 18.64% x̃: 11.11%
HURT stats (abs) min: 1 max: 181 x̄: 12.78 x̃: 7
HURT stats (rel) min: 0.42% max: 2400.00% x̄: 99.13% x̃: 18.97%
95% mean confidence interval for last-baryf value: 3.81 5.93
95% mean confidence interval for last-baryf %-change: 45.91% 65.81%
Last-baryf are HURT.
total last-helper in shared programs: 1208139 -> 1191785 (-1.35%)
last-helper in affected programs: 1080159 -> 1063805 (-1.51%)
helped: 2818
HURT: 2235
helped stats (abs) min: 1 max: 370 x̄: 22.34 x̃: 8
helped stats (rel) min: 0.08% max: 100.00% x̄: 18.27% x̃: 8.89%
HURT stats (abs) min: 1 max: 229 x̄: 20.85 x̃: 8
HURT stats (rel) min: 0.00% max: 2170.00% x̄: 25.60% x̃: 5.56%
95% mean confidence interval for last-helper value: -4.41 -2.06
95% mean confidence interval for last-helper %-change: -0.75% 3.01%
Inconclusive result (%-change mean confidence interval includes 0).
total half in shared programs: 0 -> 0
half in affected programs: 0 -> 0
helped: 0
HURT: 0
total full in shared programs: 217263 -> 228709 (5.27%)
full in affected programs: 44215 -> 55661 (25.89%)
helped: 23
HURT: 9815
helped stats (abs) min: 1 max: 16 x̄: 2.74 x̃: 2
helped stats (rel) min: 14.29% max: 50.00% x̄: 23.01% x̃: 20.00%
HURT stats (abs) min: 1 max: 16 x̄: 1.17 x̃: 1
HURT stats (rel) min: 3.45% max: 100.00% x̄: 28.31% x̃: 25.00%
95% mean confidence interval for full value: 1.15 1.18
95% mean confidence interval for full %-change: 27.94% 28.43%
Full are HURT.
total constlen in shared programs: 622684 -> 622684 (0.00%)
constlen in affected programs: 0 -> 0
helped: 0
HURT: 0
total cat0 in shared programs: 1031899 -> 1000258 (-3.07%)
cat0 in affected programs: 880633 -> 848992 (-3.59%)
helped: 11828
HURT: 7808
helped stats (abs) min: 1 max: 424 x̄: 6.12 x̃: 2
helped stats (rel) min: 0.21% max: 90.00% x̄: 27.20% x̃: 25.00%
HURT stats (abs) min: 1 max: 588 x̄: 5.22 x̃: 3
HURT stats (rel) min: 0.15% max: 3600.00% x̄: 60.50% x̃: 20.00%
95% mean confidence interval for cat0 value: -1.91 -1.31
95% mean confidence interval for cat0 %-change: 6.38% 8.96%
Inconclusive result (value mean confidence interval and %-change mean confidence interval disagree).
total cat1 in shared programs: 256026 -> 259430 (1.33%)
cat1 in affected programs: 122730 -> 126134 (2.77%)
helped: 2196
HURT: 2408
helped stats (abs) min: 1 max: 794 x̄: 4.18 x̃: 2
helped stats (rel) min: 0.54% max: 100.00% x̄: 30.91% x̃: 20.00%
HURT stats (abs) min: 1 max: 155 x̄: 5.22 x̃: 3
HURT stats (rel) min: 0.00% max: 2500.00% x̄: 55.93% x̃: 16.67%
95% mean confidence interval for cat1 value: 0.20 1.28
95% mean confidence interval for cat1 %-change: 11.49% 17.54%
Cat1 are HURT.
total cat2 in shared programs: 1512198 -> 1512327 (<.01%)
cat2 in affected programs: 36177 -> 36306 (0.36%)
helped: 51
HURT: 38
helped stats (abs) min: 1 max: 2 x̄: 1.84 x̃: 2
helped stats (rel) min: 0.28% max: 28.57% x̄: 10.21% x̃: 9.52%
HURT stats (abs) min: 1 max: 70 x̄: 5.87 x̃: 2
HURT stats (rel) min: 0.05% max: 13.19% x̄: 3.18% x̃: 1.53%
95% mean confidence interval for cat2 value: -0.29 3.19
95% mean confidence interval for cat2 %-change: -6.27% -2.71%
Inconclusive result (value mean confidence interval includes 0).
total cat3 in shared programs: 1194302 -> 1194302 (0.00%)
cat3 in affected programs: 0 -> 0
helped: 0
HURT: 0
total cat4 in shared programs: 84081 -> 84081 (0.00%)
cat4 in affected programs: 0 -> 0
helped: 0
HURT: 0
total cat5 in shared programs: 48109 -> 48058 (-0.11%)
cat5 in affected programs: 105 -> 54 (-48.57%)
helped: 51
HURT: 0
helped stats (abs) min: 1 max: 1 x̄: 1.00 x̃: 1
helped stats (rel) min: 14.29% max: 100.00% x̄: 67.78% x̃: 50.00%
95% mean confidence interval for cat5 value: -1.00 -1.00
95% mean confidence interval for cat5 %-change: -76.79% -58.78%
Cat5 are helped.
total cat6 in shared programs: 50156 -> 50145 (-0.02%)
cat6 in affected programs: 1002 -> 991 (-1.10%)
helped: 2
HURT: 1
helped stats (abs) min: 8 max: 8 x̄: 8.00 x̃: 8
helped stats (rel) min: 2.75% max: 2.75% x̄: 2.75% x̃: 2.75%
HURT stats (abs) min: 5 max: 5 x̄: 5.00 x̃: 5
HURT stats (rel) min: 1.19% max: 1.19% x̄: 1.19% x̃: 1.19%
total cat7 in shared programs: 3146 -> 3146 (0.00%)
cat7 in affected programs: 0 -> 0
helped: 0
HURT: 0
total stp in shared programs: 2448 -> 2432 (-0.65%)
stp in affected programs: 1200 -> 1184 (-1.33%)
helped: 2
HURT: 0
total ldp in shared programs: 568 -> 557 (-1.94%)
ldp in affected programs: 496 -> 485 (-2.22%)
helped: 2
HURT: 1
helped stats (abs) min: 8 max: 8 x̄: 8.00 x̃: 8
helped stats (rel) min: 19.05% max: 19.05% x̄: 19.05% x̃: 19.05%
HURT stats (abs) min: 5 max: 5 x̄: 5.00 x̃: 5
HURT stats (rel) min: 1.21% max: 1.21% x̄: 1.21% x̃: 1.21%
total sstall in shared programs: 415472 -> 417708 (0.54%)
sstall in affected programs: 320154 -> 322390 (0.70%)
helped: 2898
HURT: 3584
helped stats (abs) min: 1 max: 219 x̄: 6.60 x̃: 5
helped stats (rel) min: 0.09% max: 100.00% x̄: 28.52% x̃: 19.15%
HURT stats (abs) min: 1 max: 93 x̄: 5.96 x̃: 4
HURT stats (rel) min: 0.00% max: 1800.00% x̄: 44.40% x̃: 14.29%
95% mean confidence interval for sstall value: 0.10 0.59
95% mean confidence interval for sstall %-change: 9.69% 13.91%
Sstall are HURT.
total (ss) in shared programs: 102307 -> 102046 (-0.26%)
(ss) in affected programs: 60114 -> 59853 (-0.43%)
helped: 2591
HURT: 2605
helped stats (abs) min: 1 max: 40 x̄: 1.49 x̃: 1
helped stats (rel) min: 0.51% max: 100.00% x̄: 21.04% x̃: 16.67%
HURT stats (abs) min: 1 max: 11 x̄: 1.38 x̃: 1
HURT stats (rel) min: 0.00% max: 400.00% x̄: 35.29% x̃: 25.00%
95% mean confidence interval for (ss) value: -0.10 0.00
95% mean confidence interval for (ss) %-change: 6.16% 8.24%
Inconclusive result (value mean confidence interval includes 0).
total systall in shared programs: 765325 -> 764564 (-0.10%)
systall in affected programs: 503928 -> 503167 (-0.15%)
helped: 2717
HURT: 2446
helped stats (abs) min: 1 max: 806 x̄: 13.11 x̃: 6
helped stats (rel) min: 0.04% max: 100.00% x̄: 21.19% x̃: 11.90%
HURT stats (abs) min: 1 max: 200 x̄: 14.25 x̃: 8
HURT stats (rel) min: 0.00% max: 3600.00% x̄: 35.88% x̃: 12.31%
95% mean confidence interval for systall value: -0.82 0.53
95% mean confidence interval for systall %-change: 3.50% 8.19%
Inconclusive result (value mean confidence interval includes 0).
total (sy) in shared programs: 38451 -> 38584 (0.35%)
(sy) in affected programs: 9152 -> 9285 (1.45%)
helped: 714
HURT: 764
helped stats (abs) min: 1 max: 6 x̄: 1.18 x̃: 1
helped stats (rel) min: 1.75% max: 75.00% x̄: 30.29% x̃: 33.33%
HURT stats (abs) min: 1 max: 5 x̄: 1.28 x̃: 1
HURT stats (rel) min: 0.41% max: 400.00% x̄: 47.79% x̃: 50.00%
95% mean confidence interval for (sy) value: 0.02 0.16
95% mean confidence interval for (sy) %-change: 7.64% 12.50%
(sy) are HURT.
total waves in shared programs: 608612 -> 608298 (-0.05%)
waves in affected programs: 1662 -> 1348 (-18.89%)
helped: 17
HURT: 122
helped stats (abs) min: 2 max: 4 x̄: 2.35 x̃: 2
helped stats (rel) min: 20.00% max: 100.00% x̄: 38.82% x̃: 33.33%
HURT stats (abs) min: 2 max: 6 x̄: 2.90 x̃: 2
HURT stats (rel) min: 12.50% max: 50.00% x̄: 24.01% x̃: 25.00%
95% mean confidence interval for waves value: -2.61 -1.91
95% mean confidence interval for waves %-change: -20.30% -12.35%
Waves are HURT.
total loops in shared programs: 1088 -> 1088 (0.00%)
loops in affected programs: 0 -> 0
helped: 0
HURT: 0
</pre>
</details>https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/28664ir3: optimize SSBO accesses using isam.v and immediate offsets2024-05-20T19:55:36ZJob Noormanir3: optimize SSBO accesses using isam.v and immediate offsetsThis series improves SSBO accesses on a6xx gen4 and a7xx in two ways: by using `isam.v` for multi-component SSBO loads and by using the new immediate offset fields in `isam.v`, `ldib.b`, and `stib.b`.
Besides adding new encodings for `i...This series improves SSBO accesses on a6xx gen4 and a7xx in two ways: by using `isam.v` for multi-component SSBO loads and by using the new immediate offset fields in `isam.v`, `ldib.b`, and `stib.b`.
Besides adding new encodings for `isam.v`, supporting it is trivial as the scalarization step simply needs to be skipped whenever `isam.v` can be used.
To support the immediate offsets, a few steps needed to be taken:
- Add a `BASE` index to `load/store_ssbo_ir3` to store the offset.
- Make `nir_opt_offsets` compatible with ir3:
- Since `load_ssbo_ir3` can either be emitted as `isam.v` (8-bit immediate offset) or `ldib.b` (7-bit), and `nir_opt_offsets` currently uses a single max offset per storage type, a new callback was added that can set the max offset on a per-instruction basis.
- `nir_opt_offsets` currently bails-out on potentially wrapping additions. However, on ir3 the immediate offset addition wraps the same way as normal unsigned additions so this isn't necessary. An option was added to skip the check for wrapping additions.https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/29203ir3: always run optimize when creating shader variants2024-05-16T17:00:54ZMike Blumenkrantzir3: always run optimize when creating shader variantsThis has some weird results:
```
Totals:
Instrs: 40912883 -> 40912893 (+0.00%)
CodeSize: 82998994 -> 82998990 (-0.00%)
NOPs: 8402185 -> 8402197 (+0.00%)
(ss): 1122875 -> 1122873 (-0.00%)
(ss)-stall: 3496388 -> 3496376 (-0.00%)
Cat0: 899...This has some weird results:
```
Totals:
Instrs: 40912883 -> 40912893 (+0.00%)
CodeSize: 82998994 -> 82998990 (-0.00%)
NOPs: 8402185 -> 8402197 (+0.00%)
(ss): 1122875 -> 1122873 (-0.00%)
(ss)-stall: 3496388 -> 3496376 (-0.00%)
Cat0: 8993780 -> 8993792 (+0.00%)
Cat2: 15081662 -> 15081660 (-0.00%)
Totals from 2 (0.00% of 126999) affected shaders:
Instrs: 40 -> 50 (+25.00%)
CodeSize: 68 -> 64 (-5.88%)
NOPs: 18 -> 30 (+66.67%)
(ss): 6 -> 4 (-33.33%)
(ss)-stall: 12 -> 0 (-inf%)
Cat0: 24 -> 36 (+50.00%)
Cat2: 6 -> 4 (-33.33%)
```
In one of the pipelines I examined (I can send on demand) which has increased instruction counts, it looks like a few extra instructions get added in `nir_lower_vars_to_ssa` and then this never gets reduced again. Probably some missing optimization handling somewhere.https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/29277tu: Expose reconverge related extensions2024-05-21T21:47:18ZValentine Burleytu: Expose reconverge related extensions### What does this MR do and why?
<!-- Describe in detail what your merge request does and why. -->
I tried exposing `VK_KHR_shader_subgroup_uniform_control_flow` and `VK_KHR_shader_maximal_reconvergence` and on newer CTS version they a...### What does this MR do and why?
<!-- Describe in detail what your merge request does and why. -->
I tried exposing `VK_KHR_shader_subgroup_uniform_control_flow` and `VK_KHR_shader_maximal_reconvergence` and on newer CTS version they all pass for me on a619 with kgsl. However, running an older CTS like 1.3.8.0 `dEQP-VK.reconvergence.*` tests all fail with `Fail (Subgroup size greater than 64 not handled. at vktReconvergenceTests.cpp:1632)` so I'm not sure if we can just advertise them as it is.
I saw a TODO for moving `threadsize_base` and `max_waves` to `fd_dev_info`, so I decided to tackle it. Nevertheless, I am uncertain about the practicality of moving `max_waves`, as it currently appears that we would simply be setting it to 16 a couple dozen times.
`dEQP-VK.subgroups.subgroup_uniform_control_flow.*` on either 1.3.8.3 or 1.3.8.0
```
Test run totals:
Passed: 84/169 (49.7%)
Failed: 0/169 (0.0%)
Not supported: 85/169 (50.3%)
Warnings: 0/169 (0.0%)
Waived: 0/169 (0.0%)
```
`dEQP-VK.reconvergence.*` on 1.3.8.3
```
Test run totals:
Passed: 5983/6249 (95.7%)
Failed: 0/6249 (0.0%)
Not supported: 266/6249 (4.3%)
Warnings: 0/6249 (0.0%)
Waived: 0/6249 (0.0%)
```
`dEQP-VK.reconvergence.*` on 1.3.8.0
```
Test run totals:
Passed: 0/4800 (0.0%)
Failed: 4800/4800 (100.0%)
Not supported: 0/4800 (0.0%)
Warnings: 0/4800 (0.0%)
Waived: 0/4800 (0.0%)
```
`Fail (Subgroup size greater than 64 not handled. at vktReconvergenceTests.cpp:1632)`https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/29357ir3/a7xx: Fix FS consts corruption when other FS has zero constlen2024-05-24T16:45:37ZDanylo Piliaievir3/a7xx: Fix FS consts corruption when other FS has zero constlen### What does this MR do and why?
ir3/a7xx: Fix FS consts corruption when other FS has zero constlen
Having zero consts in one FS may corrupt consts in follow up FSs,
on such GPUs blob never has zero consts in FS. The mechanism of
corr...### What does this MR do and why?
ir3/a7xx: Fix FS consts corruption when other FS has zero constlen
Having zero consts in one FS may corrupt consts in follow up FSs,
on such GPUs blob never has zero consts in FS. The mechanism of
corruption is unknown.
Fixes geometry flickering in a number of games, including:
- Baldur's Gate 3
- Assasin's Creed Rogue
---
I stared at Baldur's Gate 3 for a while don't have better explanation of what's going on. There are no draws around affected draw that load more consts that they use, nothing out of place, but setting doing `MAX(fs->constlen, 4)` fixes the issue. And giving that blob doesn't set zero constlen on a740 suggests that it is indeed a hardware quirk.