intel/brw: scoreboarding regression

Commit ff89e831

Author: Caio Oliveira <caio.oliveira@intel.com>
Date:   Thu Apr 4 16:03:34 2024 -0700

    intel/brw: Lower VGRFs to FIXED_GRFs earlier
    
    Moves the lowering of VGRFs into FIXED_GRFs from the code generation
    to (almost) right after the register allocation.
    
    This will allow: (1) later passes not worry about VGRFs (and what they
    mean in a post reg alloc phase) and (2) make easier to add certain
    types of validation post reg alloc phase using the backend IR.
    
    Note that a couple of passes still take advantage of seeing "allocated
    VGRFs", so perform lowering after they run.
    
    Reviewed-by: Ian Romanick <ian.d.romanick@intel.com>
    Reviewed-by: Kenneth Graunke <kenneth@whitecape.org>
    Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/28604>

is triggering a scoreboarding regression on a branch I'm working on.

Since my modifications are all before the optimization passes & register allocation, it looks like the problem was introduced by the commit above.

I can repro on simulation on DG2 with dEQP-VK.pipeline.monolithic.extended_dynamic_state.mesh_shader.cmd_buffer_start.stencil_state_face_front_lt_inc_clamp_clear_254_ref_253_pass

With commit reverted :

Native code for unnamed mesh shader (null) (src_hash 0x48a71db2) (sha1 4e81188fb6d1bfb4e44efc46b8d6aac79a4f303e)
SIMD8 shader: 122 instructions. 0 loops. 1113 cycles. 0:0 spills:fills, 8 sends, scheduled with mode top-down. Promoted 0 constants. Compacted 1952 to 1776 bytes (9%)
   START B0 (90 cycles)
send(1)         g3UD            g2UD            nullUD          0x0228d580                0x00000000
                            ugm MsgDesc: ( load, a64, d32, V16, transpose, L1C_L3C dst_len = 2, src0_len = 1, src1_len = 0 flat )  base_offset 0  { align1 WE_all 1N $0 };
add(1)          g76<2>UD        g2<0,1,0>UD     0x00000080UD    { align1 WE_all 1N $0.src };
and(8)          g59<1>UD        g0.6<0,1,0>UD   0x0000ffffUD    { align1 1Q };
mov(8)          g56<1>UD        g0.13<0,1,0>UW                  { align1 1Q };
mov(8)          g57<1>UD        g0.8<0,1,0>UW                   { align1 1Q };
mov(8)          g6<1>UD         g0.9<0,1,0>UW                   { align1 1Q };
cmp.l.f0.0(1)   null<1>UD       g76<0,1,0>UD    g2<0,1,0>UD     { align1 WE_all 1N I@5 };
add(8)          g7<1>D          g57<1,1,0>D     g6<1,1,0>D      { align1 1Q I@2 compacted };
(+f0.0) add(1)  g76.1<2>UD      g2.1<0,1,0>UD   0x00000001UD    { align1 WE_all 1N };
(-f0.0) mov(1)  g76.1<2>UD      g2.1<0,1,0>UD                   { align1 WE_all 1N I@1 };
mov(8)          g2<1>UD         g0.1<0,1,0>UD                   { align1 1Q };
sync nop(1)                     null<0,1,0>UB                   { align1 WE_all 1N I@2 };
send(1)         g5UD            g76UD           nullUD          0x0218c580                0x00000000
                            ugm MsgDesc: ( load, a64, d32, V8, transpose, L1C_L3C dst_len = 1, src0_len = 1, src1_len = 0 flat )  base_offset 0  { align1 WE_all 1N $1 };
cmp.nz.f0.0(8)  null<1>D        g7<8,8,1>D      2D              { align1 1Q I@4 };
(+f0.0) if(8)   JIP:  LABEL0          UIP:  LABEL0              { align1 1Q };
   END B0 ->B1 ->B2
   START B1 <-B0 (448 cycles)

Without revert :

Native code for unnamed mesh shader (null) (src_hash 0x48a71db2) (sha1 c34d03a6e2a9ce0049b62b466e0d03768657fecd)
SIMD8 shader: 122 instructions. 0 loops. 1111 cycles. 0:0 spills:fills, 8 sends, scheduled with mode top-down. Promoted 0 constants. Compacted 1952 to 1776 bytes (9%)
   START B0 (88 cycles)
send(1)         g3UD            g2UD            nullUD          0x0228d580                0x00000000
                            ugm MsgDesc: ( load, a64, d32, V16, transpose, L1C_L3C dst_len = 2, src0_len = 1, src1_len = 0 flat )  base_offset 0  { align1 WE_all 1N $0 };
add(1)          g76<2>UD        g2<0,1,0>UD     0x00000080UD    { align1 WE_all 1N $0.src };
and(8)          g59<1>UD        g0.6<0,1,0>UD   0x0000ffffUD    { align1 1Q };
mov(8)          g56<1>UD        g0.13<0,1,0>UW                  { align1 1Q };
mov(8)          g57<1>UD        g0.8<0,1,0>UW                   { align1 1Q };
mov(8)          g6<1>UD         g0.9<0,1,0>UW                   { align1 1Q };
cmp.l.f0.0(1)   null<1>UD       g76<0,1,0>UD    g2<0,1,0>UD     { align1 WE_all 1N };
add(8)          g7<1>D          g57<1,1,0>D     g6<1,1,0>D      { align1 1Q I@2 compacted };
(+f0.0) add(1)  g76.1<2>UD      g2.1<0,1,0>UD   0x00000001UD    { align1 WE_all 1N I@7 };
(-f0.0) mov(1)  g76.1<2>UD      g2.1<0,1,0>UD                   { align1 WE_all 1N I@1 };
mov(8)          g2<1>UD         g0.1<0,1,0>UD                   { align1 1Q };
sync nop(1)                     null<0,1,0>UB                   { align1 WE_all 1N I@2 };
send(1)         g5UD            g76UD           nullUD          0x0218c580                0x00000000
                            ugm MsgDesc: ( load, a64, d32, V8, transpose, L1C_L3C dst_len = 1, src0_len = 1, src1_len = 0 flat )  base_offset 0  { align1 WE_all 1N $1 };
cmp.nz.f0.0(8)  null<1>D        g7<8,8,1>D      2D              { align1 1Q I@4 };
(+f0.0) if(8)   JIP:  LABEL0          UIP:  LABEL0              { align1 1Q };
   END B0 ->B1 ->B2
   START B1 <-B0 (448 cycles)

To upload designs, you'll need to enable LFS and have an admin enable hashed storage. More information

Admin message

intel/brw: scoreboarding regression