...
 
Commits (15)
  • Timothy Arceri's avatar
    nir: propagate known constant values into the if-then branch · 38f07e0a
    Timothy Arceri authored
    Helps Max Waves / VGPR use in a bunch of Unigine Heaven
    shaders.
    
    shader-db results radeonsi (VEGA):
    Totals from affected shaders:
    SGPRS: 5505440 -> 5505872 (0.01 %)
    VGPRS: 3077520 -> 3077296 (-0.01 %)
    Spilled SGPRs: 39032 -> 39030 (-0.01 %)
    Spilled VGPRs: 16326 -> 16326 (0.00 %)
    Private memory VGPRs: 0 -> 0 (0.00 %)
    Scratch size: 744 -> 744 (0.00 %) dwords per thread
    Code Size: 123755028 -> 123753316 (-0.00 %) bytes
    Compile Time: 2751028 -> 2560786 (-6.92 %) milliseconds
    LDS: 1415 -> 1415 (0.00 %) blocks
    Max Waves: 972192 -> 972240 (0.00 %)
    Wait states: 0 -> 0 (0.00 %)
    
    vkpipeline-db results RADV (VEGA):
    
    Totals from affected shaders:
    SGPRS: 160 -> 160 (0.00 %)
    VGPRS: 88 -> 88 (0.00 %)
    Spilled SGPRs: 0 -> 0 (0.00 %)
    Spilled VGPRs: 0 -> 0 (0.00 %)
    Private memory VGPRs: 0 -> 0 (0.00 %)
    Scratch size: 0 -> 0 (0.00 %) dwords per thread
    Code Size: 18268 -> 18152 (-0.63 %) bytes
    LDS: 0 -> 0 (0.00 %) blocks
    Max Waves: 26 -> 26 (0.00 %)
    Wait states: 0 -> 0 (0.00 %)
    
    v3: disable opt_for_known_values() for non scalar constants
    
    Reviewed-by: Caio Marcelo de Oliveira Filho <caio.oliveira@intel.com> (v2)
    38f07e0a
  • Jason Ekstrand's avatar
    521c1f58
  • Jason Ekstrand's avatar
    nir/gcm: Loop over blocks in pin_instructions · f9dd4967
    Jason Ekstrand authored
    Now that we have the new block iterators, we can simplify things a bit.
    f9dd4967
  • Jason Ekstrand's avatar
    nir/gcm: Use an array for storring the early block · b911b27d
    Jason Ekstrand authored
    We are about to adjust our instruction block assignment algorithm and we
    will want to know the current block that the instruction lives in.  In
    order to allow for this, we can't overwrite nir_instr::block in the
    early scheduling pass.
    b911b27d
  • Jason Ekstrand's avatar
    4672f723
  • Jason Ekstrand's avatar
    nir/gcm: Add a real concept of "progress" · e3ba1973
    Jason Ekstrand authored
    Now that the GCM pass is more conservative and only moves instructions
    to different blocks when it's advantageous to do so, we can have a
    proper notion of what it means to make progress.
    e3ba1973
  • Jason Ekstrand's avatar
    nir/gcm: Delete dead instructions · 741aa04e
    Jason Ekstrand authored
    Classically, global code motion is also a dead code pass.  However, in
    the initial implementation, the decision was made to place every
    instruction and let conventional DCE clean up the dead ones.  Because
    any uses of a dead instruction are unreachable, we have no late block
    and the dead instructions are always scheduled early.  The problem is
    that, because we place the dead instruction early, it  pushes the
    placement of any dependencies of the dead instruction earlier than they
    may need to be placed.  In order prevent dead instructions from
    affecting the placement of live ones, we need to delete them.
    741aa04e
  • Jason Ekstrand's avatar
    f5f4f709
  • Timothy Arceri's avatar
    nir/gcm: allow derivative dependent intrinisics to be moved earlier · c562b86e
    Timothy Arceri authored
    We can't move them later as we could move them into non-uniform
    control flow, but moving them earlier should be fine.
    
    This helps avoid a bunch of spilling in unigine shaders due to
    moving the tex instructions sources earlier (outside if branches)
    but not the instruction itself.
    c562b86e
  • Timothy Arceri's avatar
    nir/gcm: be more conservative about moving instructions from loops · f940ffa8
    Timothy Arceri authored
    Here we only pull instructions further up control flow if they are
    constant or texture instructions. See the code comment for more
    information.
    f940ffa8
  • Timothy Arceri's avatar
    nir/gcm: dont move movs unless we can replace them later with their src · 6136d67a
    Timothy Arceri authored
    This helps us avoid moving the movs outside if branches when there
    src can't be scalarized.
    
    For example it avoids:
    
       vec4 32 ssa_7 = tex ssa_6 (coord), 0 (texture), 0 (sampler),
       if ... {
          r0 = imov ssa_7.z
          r1 = imov ssa_7.y
          r2 = imov ssa_7.x
          r3 = imov ssa_7.w
          ...
       } else {
          ...
          if ... {
             r0 = imov ssa_7.x
             r1 = imov ssa_7.w
             ...
          else {
             r0 = imov ssa_7.z
             r1 = imov ssa_7.y
             ...
          }
          r2 = imov ssa_7.x
          r3 = imov ssa_7.w
       }
       ...
       vec4 32 ssa_36 = vec4 r0, r1, r2, r3
    
    Becoming something like:
    
       vec4 32 ssa_7 = tex ssa_6 (coord), 0 (texture), 0 (sampler),
       r0 = imov ssa_7.z
       r1 = imov ssa_7.y
       r2 = imov ssa_7.x
       r3 = imov ssa_7.w
    
       if ... {
          ...
       } else {
          if ... {
             r0 = imov r2
             r1 = imov r3
             ...
          else {
             ...
          }
          ...
       }
    
    While this is has a smaller instruction count it requires more work
    for the same result. With more complex examples we can also end up
    shuffling the registers around in a way that requires more registers
    to use as temps so that we don't overwrite our original values along
    the way.
    6136d67a
  • Timothy Arceri's avatar
    nir/gcm: be less destructive with pinned instructions order · 2c00fcc1
    Timothy Arceri authored
    This changes the pass to extract pinned instructions and not just unpinned
    instructions when rescheduling instructions. This stops pinned instructions
    from being bunched together which can result in regressions with regards to
    cycles and instruction counts on i965 and register use/Max Waves on AMD
    hardware.
    2c00fcc1
  • Timothy Arceri's avatar
    anv/i965: call nir_opt_dead_cf() after we have finished all opts · d49ef2a3
    Timothy Arceri authored
    This will avoid a regression with the following patch.
    d49ef2a3
  • Timothy Arceri's avatar
    anv/i965: Use GCM and GVN in the first run of nir_optimize · 915415d8
    Timothy Arceri authored
    A large amount of the cycle hurt in shader-db appears to be due to
    calling discard earlier in fragment shaders.
    
    Shader-db results (SKL):
    
    total instructions in shared programs: 15422340 -> 15015725 (-2.64%)
    instructions in affected programs: 2744968 -> 2338353 (-14.81%)
    helped: 4402
    HURT: 465
    
    total cycles in shared programs: 358373791 -> 243554152 (-32.04%)
    cycles in affected programs: 213519287 -> 98699648 (-53.77%)
    helped: 3930
    HURT: 1034
    
    total loops in shared programs: 4366 -> 4362 (-0.09%)
    loops in affected programs: 8 -> 4 (-50.00%)
    helped: 4
    HURT: 0
    
    total spills in shared programs: 23673 -> 16143 (-31.81%)
    spills in affected programs: 18904 -> 11374 (-39.83%)
    helped: 343
    HURT: 65
    
    total fills in shared programs: 32036 -> 29845 (-6.84%)
    fills in affected programs: 18381 -> 16190 (-11.92%)
    helped: 345
    HURT: 65
    
    LOST:   10
    GAINED: 6
    
    Shader-db results (SKL) - Dolphin Uber shader results removed:
    
    total instructions in shared programs: 15239852 -> 14811490 (-2.81%)
    instructions in affected programs: 2562485 -> 2134123 (-16.72%)
    helped: 4387
    HURT: 425
    
    total cycles in shared programs: 291098452 -> 203050594 (-30.25%)
    cycles in affected programs: 146243972 -> 58196114 (-60.21%)
    helped: 3875
    HURT: 1034
    
    total loops in shared programs: 3666 -> 3662 (-0.11%)
    loops in affected programs: 8 -> 4 (-50.00%)
    helped: 4
    HURT: 0
    
    total spills in shared programs: 21099 -> 5755 (-72.72%)
    spills in affected programs: 16330 -> 986 (-93.96%)
    helped: 343
    HURT: 10
    
    total fills in shared programs: 28481 -> 15614 (-45.18%)
    fills in affected programs: 14826 -> 1959 (-86.79%)
    helped: 345
    HURT: 10
    
    LOST:   11
    GAINED: 6
    
    Shader-db results (SKL) -
    Dolphin Uber and Deus-Ex shader results removed:
    
    total instructions in shared programs: 13596456 -> 13545174 (-0.38%)
    instructions in affected programs: 1538081 -> 1486799 (-3.33%)
    helped: 3736
    HURT: 422
    
    total cycles in shared programs: 154429067 -> 153949773 (-0.31%)
    cycles in affected programs: 18702937 -> 18223643 (-2.56%)
    helped: 3251
    HURT: 985
    
    total loops in shared programs: 2381 -> 2377 (-0.17%)
    loops in affected programs: 8 -> 4 (-50.00%)
    helped: 4
    HURT: 0
    
    total spills in shared programs: 3914 -> 4163 (6.36%)
    spills in affected programs: 597 -> 846 (41.71%)
    helped: 1
    HURT: 10
    
    total fills in shared programs: 13618 -> 13745 (0.93%)
    fills in affected programs: 1421 -> 1548 (8.94%)
    helped: 3
    HURT: 8
    
    LOST:   11
    GAINED: 5
    915415d8
  • Timothy Arceri's avatar
    nir: remove restrictions on opt_if_loop_last_continue() · e0a24b66
    Timothy Arceri authored
    When I implemented opt_if_loop_last_continue() I had restricted
    this pass from moving other if-statements inside the branch opposite
    the continue. At the time it was causing extra regisiter pressure
    in some shaders, however that no longer seems to be an issue.
    
    Samuel Pitoiset noticed that making this pass more aggressive
    significantly improved the performance of Doom on RADV. Below are
    the statistics he gathered.
    
    28717 shaders in 14931 tests
    Totals:
    SGPRS: 1267317 -> 1267549 (0.02 %)
    VGPRS: 896876 -> 895920 (-0.11 %)
    Spilled SGPRs: 24701 -> 26367 (6.74 %)
    Code Size: 48379452 -> 48507880 (0.27 %) bytes
    Max Waves: 241159 -> 241190 (0.01 %)
    
    Totals from affected shaders:
    SGPRS: 23584 -> 23816 (0.98 %)
    VGPRS: 25908 -> 24952 (-3.69 %)
    Spilled SGPRs: 503 -> 2169 (331.21 %)
    Code Size: 2471392 -> 2599820 (5.20 %) bytes
    Max Waves: 586 -> 617 (5.29 %)
    
    The codesize increases is related to Wolfenstein II.
    
    This gives +10% FPS with Doom on my Vega56.
    
    Rhys Perry also benchmarked Doom on his VEGA64:
    
    Before: 72.53 FPS
    After:  80.77 FPS
    
    shader-db results i965 (SKL):
    
    total instructions in shared programs: 15029076 -> 15029877 (<.01%)
    instructions in affected programs: 493251 -> 494052 (0.16%)
    helped: 3
    HURT: 374
    
    total cycles in shared programs: 263387688 -> 263401720 (<.01%)
    cycles in affected programs: 30658226 -> 30672258 (0.05%)
    helped: 3
    HURT: 374
    
    total spills in shared programs: 9691 -> 9748 (0.59%)
    spills in affected programs: 88 -> 145 (64.77%)
    helped: 0
    HURT: 4
    
    total fills in shared programs: 22076 -> 22133 (0.26%)
    fills in affected programs: 128 -> 185 (44.53%)
    helped: 0
    HURT: 4
    
    LOST:   0
    GAINED: 2
    
    Both the gain and spill hurt are in Doom shaders which is similiar
    to what we see on radeonsi ironically the Doom Vulkan shaders are
    the most helped by this change.
    Reviewed-by: default avatarIan Romanick <ian.d.romanick@intel.com>
    e0a24b66
......@@ -1875,13 +1875,20 @@ typedef struct nir_block {
* dom_pre_index and dom_post_index for this block, which makes testing if
* a given block is dominated by another block an O(1) operation.
*/
unsigned dom_pre_index, dom_post_index;
int16_t dom_pre_index, dom_post_index;
/* live in and out for this block; used for liveness analysis */
BITSET_WORD *live_in;
BITSET_WORD *live_out;
} nir_block;
static inline bool
nir_block_is_reachable(nir_block *b)
{
/* See also nir_block_dominates */
return b->dom_post_index != -1;
}
static inline nir_instr *
nir_block_first_instr(nir_block *block)
{
......
......@@ -42,6 +42,10 @@ init_block(nir_block *block, nir_function_impl *impl)
block->imm_dom = NULL;
block->num_dom_children = 0;
/* See nir_block_dominates */
block->dom_pre_index = INT16_MAX;
block->dom_post_index = -1;
set_foreach(block->dom_frontier, entry) {
_mesa_set_remove(block->dom_frontier, entry);
}
......@@ -201,18 +205,25 @@ nir_calc_dominance(nir_shader *shader)
}
}
static nir_block *
block_return_if_reachable(nir_block *b)
{
return (b && nir_block_is_reachable(b)) ? b : NULL;
}
/**
* Computes the least common anscestor of two blocks. If one of the blocks
* is null, the other block is returned.
* Computes the least common ancestor of two blocks. If one of the blocks
* is null or unreachable, the other block is returned or NULL if it's
* unreachable.
*/
nir_block *
nir_dominance_lca(nir_block *b1, nir_block *b2)
{
if (b1 == NULL)
return b2;
if (b1 == NULL || !nir_block_is_reachable(b1))
return block_return_if_reachable(b2);
if (b2 == NULL)
return b1;
if (b2 == NULL || !nir_block_is_reachable(b2))
return block_return_if_reachable(b1);
assert(nir_cf_node_get_function(&b1->cf_node) ==
nir_cf_node_get_function(&b2->cf_node));
......@@ -224,7 +235,15 @@ nir_dominance_lca(nir_block *b1, nir_block *b2)
}
/**
* Returns true if parent dominates child
* Returns true if parent dominates child according to the following
* definition:
*
* "The block A dominates the block B if every path from the start block
* to block B passes through A."
*
* This means, in particular, that any unreachable block is dominated by every
* other block and unreachable block does not dominate anything except another
* unreachable block.
*/
bool
nir_block_dominates(nir_block *parent, nir_block *child)
......@@ -235,6 +254,10 @@ nir_block_dominates(nir_block *parent, nir_block *child)
assert(nir_cf_node_get_function(&parent->cf_node)->valid_metadata &
nir_metadata_dominance);
/* If a block is unreachable, then nir_block::dom_pre_index == INT32_MAX
* and nir_block::dom_post_index == -1. This allows us to trivially handle
* unreachable blocks here with zero extra work.
*/
return child->dom_pre_index >= parent->dom_pre_index &&
child->dom_post_index <= parent->dom_post_index;
}
......
This diff is collapsed.
......@@ -823,48 +823,48 @@ nir_block_ends_in_continue(nir_block *block)
*
* The continue should then be removed by nir_opt_trivial_continues() and the
* loop can potentially be unrolled.
*
* Note: do_work_2() is only ever blocks and nested loops. We could also nest
* other if-statments in the branch which would allow further continues to
* be removed. However in practice this can result in increased register
* pressure.
*/
static bool
opt_if_loop_last_continue(nir_loop *loop)
{
/* Get the last if-stament in the loop */
nir_if *nif;
bool then_ends_in_continue;
bool else_ends_in_continue;
/* Scan the control flow of the loop from the last to the first node
* looking for an if-statement we can optimise.
*/
nir_block *last_block = nir_loop_last_block(loop);
nir_cf_node *if_node = nir_cf_node_prev(&last_block->cf_node);
while (if_node) {
if (if_node->type == nir_cf_node_if)
break;
if_node = nir_cf_node_prev(if_node);
}
if (if_node->type == nir_cf_node_if) {
nif = nir_cf_node_as_if(if_node);
nir_block *then_block = nir_if_last_then_block(nif);
nir_block *else_block = nir_if_last_else_block(nif);
if (!if_node || if_node->type != nir_cf_node_if)
return false;
then_ends_in_continue = nir_block_ends_in_continue(then_block);
else_ends_in_continue = nir_block_ends_in_continue(else_block);
nir_if *nif = nir_cf_node_as_if(if_node);
nir_block *then_block = nir_if_last_then_block(nif);
nir_block *else_block = nir_if_last_else_block(nif);
/* If both branches end in a jump do nothing, this should be handled
* by nir_opt_dead_cf().
*/
if ((then_ends_in_continue || nir_block_ends_in_break(then_block)) &&
(else_ends_in_continue || nir_block_ends_in_break(else_block)))
return false;
bool then_ends_in_continue = nir_block_ends_in_continue(then_block);
bool else_ends_in_continue = nir_block_ends_in_continue(else_block);
/* If continue found stop scanning and attempt optimisation */
if (then_ends_in_continue || else_ends_in_continue)
break;
}
/* If both branches end in a continue do nothing, this should be handled
* by nir_opt_dead_cf().
*/
if ((then_ends_in_continue || nir_block_ends_in_break(then_block)) &&
(else_ends_in_continue || nir_block_ends_in_break(else_block)))
return false;
if_node = nir_cf_node_prev(if_node);
}
/* If we didn't find an if to optimise return */
if (!then_ends_in_continue && !else_ends_in_continue)
return false;
/* if the block after the if/else is empty we bail, otherwise we might end
* up looping forever
*/
/* If there is nothing after the if-statement we bail */
if (&nif->cf_node == nir_cf_node_prev(&last_block->cf_node) &&
exec_list_is_empty(&last_block->instr_list))
return false;
......@@ -1326,6 +1326,66 @@ opt_if_merge(nir_if *nif)
return progress;
}
/* Perform optimisations based on the values we can derive from the evaluation
* of if-statement conditions.
*/
static bool
opt_for_known_values(nir_builder *b, nir_if *nif)
{
bool progress = false;
assert(nif->condition.is_ssa);
nir_ssa_def *if_cond = nif->condition.ssa;
if (if_cond->parent_instr->type != nir_instr_type_alu)
return false;
nir_alu_instr *alu = nir_instr_as_alu(if_cond->parent_instr);
switch (alu->op) {
case nir_op_feq:
case nir_op_ieq: {
nir_load_const_instr *load_const = NULL;
nir_ssa_def *unknown_val = NULL;
nir_ssa_def *src0 = alu->src[0].src.ssa;
nir_ssa_def *src1 = alu->src[1].src.ssa;
if (src0->parent_instr->type == nir_instr_type_load_const) {
load_const = nir_instr_as_load_const(src0->parent_instr);
unknown_val = src1;
} else if (src1->parent_instr->type == nir_instr_type_load_const) {
load_const = nir_instr_as_load_const(src1->parent_instr);
unknown_val = src0;
}
if (!load_const)
return false;
/* TODO: remove this and support swizzles? */
if (unknown_val->num_components != 1 ||
load_const->def.num_components != 1)
return false;
/* Replace unknown ssa uses with the known constant */
nir_foreach_use_safe(use_src, unknown_val) {
nir_cursor cursor = nir_before_src(use_src, false);
nir_block *use_block = nir_cursor_current_block(cursor);
if (nir_block_dominates(nir_if_first_then_block(nif), use_block)) {
nir_instr_rewrite_src(use_src->parent_instr, use_src,
nir_src_for_ssa(&load_const->def));
progress = true;
}
}
break;
}
default:
return false;
}
return progress;
}
static bool
opt_if_cf_list(nir_builder *b, struct exec_list *cf_list)
{
......@@ -1380,6 +1440,7 @@ opt_if_safe_cf_list(nir_builder *b, struct exec_list *cf_list)
progress |= opt_if_safe_cf_list(b, &nif->then_list);
progress |= opt_if_safe_cf_list(b, &nif->else_list);
progress |= opt_if_evaluate_condition_use(b, nif);
progress |= opt_for_known_values(b, nif);
break;
}
......
......@@ -614,6 +614,7 @@ brw_nir_optimize(nir_shader *nir, const struct brw_compiler *compiler,
OPT(nir_opt_loop_unroll, indirect_mask);
}
OPT(nir_opt_remove_phis);
OPT(nir_opt_gcm, true);
OPT(nir_opt_undef);
OPT(nir_lower_pack);
} while (progress);
......@@ -879,6 +880,7 @@ brw_postprocess_nir(nir_shader *nir, const struct brw_compiler *compiler,
OPT(nir_copy_prop);
OPT(nir_opt_dce);
OPT(nir_opt_move_comparisons);
OPT(nir_opt_dead_cf);
OPT(nir_lower_bool_to_int32);
......