Skip to content
  1. Jan 20, 2020
    • Erico Nunes's avatar
      lima/ppir: handle write to dead registers in ppir · d01ffcd9
      Erico Nunes authored
      
      
      nir can output writes to dead registers when expanding vec4 operations
      to non-ssa registers. In that case, some components of the vec4 may be
      assigned but never read. The ppir scheduler reorders instructions and
      may place such an instruction writing to a dead register somewhere else
      in the program.
      In order to prevent regalloc from allocating a live register for this
      operation, an interference must be assigned to it during liveness
      analysis.
      
      This workaround may be removed in the future if the assignments to dead
      components can be removed earlier in ppir or nir.
      
      Signed-off-by: default avatarErico Nunes <nunes.erico@gmail.com>
      d01ffcd9
    • Erico Nunes's avatar
      lima/ppir: reserve register for undef operations · 621de4d8
      Erico Nunes authored
      
      
      Even though the value of undef operations doesn't really matter,
      regalloc must ensure that a live register won't be used.
      Otherwise, it may overwrite a live value and cause incorrect results.
      
      Signed-off-by: default avatarErico Nunes <nunes.erico@gmail.com>
      621de4d8
    • Erico Nunes's avatar
      lima/ppir: fix src read mask swizzling · cf730d18
      Erico Nunes authored
      
      
      The src mask can't be calculated from the dest write_mask.
      Instead, it must be calculated from the swizzled operators of the src.
      Otherwise, liveness calculation may report incorrect live components for
      non-ssa registers.
      
      Signed-off-by: default avatarErico Nunes <nunes.erico@gmail.com>
      cf730d18
  2. Jan 18, 2020
  3. Jan 17, 2020
    • Rob Clark's avatar
      freedreno/a6xx: add PROG_FB_RAST stateobj · 95187083
      Rob Clark authored
      
      
      For the handful of registers that depend on the union of program/
      framebuffer/rasterizer state.
      
      Signed-off-by: default avatarRob Clark <robdclark@chromium.org>
      Reviewed-by: default avatarKristian H. Kristensen <hoegsberg@google.com>
      Tested-by: Marge Bot <!3435>
      Part-of: <!3435>
      95187083
    • Rob Clark's avatar
      freedreno/a6xx: move dynamic program state to streaming stateobj · 6dc9b292
      Rob Clark authored
      
      
      Move the program state which we can't pre-bake to a streaming state
      object, rather than emitting directly in the draw cmdstream.
      
      Signed-off-by: default avatarRob Clark <robdclark@chromium.org>
      Reviewed-by: default avatarKristian H. Kristensen <hoegsberg@google.com>
      Part-of: <!3435>
      6dc9b292
    • Rob Clark's avatar
      d2fd6469
    • Rob Clark's avatar
      freedreno/a6xx: separate rast stateobj for prim restart · 4d8f42c8
      Rob Clark authored
      
      
      This lets us move PC_PRIMITIVE_CNTL into the rasterizr stateobj, rather
      than unconditionally emitting it directly in the cmdstream on every
      draw.
      
      This also starts adding some tracking about previous draw state, so that
      following patches can limit some of the register writes we currently
      emit on every draw.
      
      Signed-off-by: default avatarRob Clark <robdclark@chromium.org>
      Reviewed-by: default avatarKristian H. Kristensen <hoegsberg@google.com>
      Part-of: <!3435>
      4d8f42c8
    • Rob Clark's avatar
      freedreno/a6xx: cleanup rasterizer state · 0e063b30
      Rob Clark authored
      
      
      All but one of the reg values is only used in the stateobj, so we can
      inline the register value setup and stateobj construction.  While we
      are at it, switch over to the new register builders.
      
      Prep work for next patch.
      
      Signed-off-by: default avatarRob Clark <robdclark@chromium.org>
      Reviewed-by: default avatarKristian H. Kristensen <hoegsberg@google.com>
      Part-of: <!3435>
      0e063b30
    • Rob Clark's avatar
      freedreno/a6xx: limit scratch/debug markers to debug builds · fba7e6f8
      Rob Clark authored
      
      
      The overhead does seem to matter when you have a high enough # of draw
      calls that effect few bins/pixels, because these writes would happen
      unconditionally (ie. not part of a state-group).
      
      Possibly we could keep these if we moved them into a state-group so the
      register writes would be no-ops on bins with no geometry.  OTOH I
      usually end up adding in a WFI when using them scratch reg values to
      track down a crash.  (So add a WFI to mitigate the annoyance of needing
      to use a debug build to get scratch regs to locate the position of a
      crash/hang in the cmdstream.)
      
      Signed-off-by: default avatarRob Clark <robdclark@chromium.org>
      Reviewed-by: default avatarKristian H. Kristensen <hoegsberg@google.com>
      Part-of: <!3435>
      fba7e6f8
    • Jordan Justen's avatar
    • Craig Stout's avatar
      util/vector: Fix u_vector_foreach when head rolls over · c1104e4c
      Craig Stout authored
      Also add unit tests for u_vector.
      
      Tested-by: Marge Bot <!3453>
      Part-of: <!3453>
      c1104e4c
    • Francisco Jerez's avatar
      intel/fs: Switch to standard vector layout for barycentrics at optimization time. · b54b67e0
      Francisco Jerez authored
      
      
      This involves permuting the registers of barycentric vectors to have
      the standard X[0-n] Y[0-n] layout at NIR translation time.
      Barycentrics are converted to the format expected by the PLN
      instruction in the lower_barycentrics() pass run after the
      optimization loop.
      
      Main reason is correctness of SIMD32 fragment shaders.  The
      shuffle_from_pln_layout() and shuffle_to_pln_layout() helpers used
      during NIR translation are busted for SIMD32.  This leads to serious
      corruption at present with INTEL_DEBUG=do32, especially on Gen11+
      where these helpers are hit more frequently due to the lack of a
      hardware PLN instruction.
      
      Of course one could have chosen to fix those helpers instead, but
      there is another far more subtle issue that was reported during review
      of the SIMD32 fragment shader codegen changes: The SIMD splitting pass
      currently handles SIMD32 barycentric vectors as if they had the
      standard X[0-n] Y[0-n] layout, even though they are interleaved for
      the PLN instruction, which causes incorrect execution masks to be
      applied to the MOVs unzipping barycentric vectors in cases where a
      LINTERP instruction occurs under non-uniform control flow.
      
      I'm not aware of any conformance regressions due to the latter issue
      at present, but for our peace of mind let's move the conversion to the
      PLN layout into the lower_barycentrics() pass run after
      lower_simd_width().
      
      This leads to the following shader-db improvements (including SIMD32
      shaders) in combination with the previous back-end preparation changes
      -- Without them (especially the copy propagation changes) this would
      lead to a massive number of regressions.  On ICL:
      
         total instructions in shared programs: 20662316 -> 20466903 (-0.95%)
         instructions in affected programs: 10538474 -> 10343061 (-1.85%)
         helped: 68775
         HURT: 6
      
         total spills in shared programs: 8938 -> 8748 (-2.13%)
         spills in affected programs: 376 -> 186 (-50.53%)
         helped: 9
         HURT: 5
      
         total fills in shared programs: 8965 -> 8663 (-3.37%)
         fills in affected programs: 965 -> 663 (-31.30%)
         helped: 9
         HURT: 6
      
         LOST:   146
         GAINED: 43
      
      On SKL:
      
         total instructions in shared programs: 18725867 -> 18614912 (-0.59%)
         instructions in affected programs: 3876590 -> 3765635 (-2.86%)
         helped: 27492
         HURT: 2
      
         LOST:   191
         GAINED: 417
      
      On SNB:
      
         total instructions in shared programs: 14573613 -> 13980646 (-4.07%)
         instructions in affected programs: 5199074 -> 4606107 (-11.41%)
         helped: 29998
         HURT: 0
      
         LOST:   21
         GAINED: 30
      
      Results are somewhat less impressive but still significant without
      SIMD32 fragment shaders enabled.  On ICL:
      
         total instructions in shared programs: 16148728 -> 16061659 (-0.54%)
         instructions in affected programs: 6114788 -> 6027719 (-1.42%)
         helped: 42046
         HURT: 6
      
         total spills in shared programs: 8218 -> 8028 (-2.31%)
         spills in affected programs: 376 -> 186 (-50.53%)
         helped: 9
         HURT: 5
      
         total fills in shared programs: 8953 -> 8651 (-3.37%)
         fills in affected programs: 965 -> 663 (-31.30%)
         helped: 9
         HURT: 6
      
         LOST:   0
         GAINED: 3
      
      On SKL:
      
         total instructions in shared programs: 14927994 -> 14926738 (-0.01%)
         instructions in affected programs: 168850 -> 167594 (-0.74%)
         helped: 711
         HURT: 2
      
      On SNB:
      
         total instructions in shared programs: 10770538 -> 10734403 (-0.34%)
         instructions in affected programs: 2702172 -> 2666037 (-1.34%)
         helped: 17818
         HURT: 0
      
      All of the hurt shaders are either spilling slightly more or emitting
      additional NOP instructions due to the SIMD16 POW workaround for
      Gen8-9 combined with differences in scheduling.
      
      Reviewed-by: Kenneth Graunke's avatarKenneth Graunke <kenneth@whitecape.org>
      b54b67e0
    • Francisco Jerez's avatar
      intel/fs: Introduce barycentric layout lowering pass. · 79bd252d
      Francisco Jerez authored
      
      
      The goal is to represent barycentrics with the standard vector layout
      during optimization and particularly SIMD lowering.  Instead of
      emitting the barycentric layout conversions at NIR translation time,
      do it later as a lowering pass.  For the moment this is only applied
      to PI messages, but we'll give the same treatment to LINTERP
      instructions too.
      
      Reviewed-by: Kenneth Graunke's avatarKenneth Graunke <kenneth@whitecape.org>
      79bd252d
    • Francisco Jerez's avatar
      intel/fs: Split fetch_payload_reg() into separate helper for barycentrics. · 44d7d66a
      Francisco Jerez authored
      
      
      We're about to change the layout of barycentric vectors, which will
      involve permuting the GRFs of barycentrics fetched from the thread
      payload.  Make room for this in a function separate from the generic
      fetch_payload_reg(), since the permutation will only be applicable to
      barycentric vectors.  This allows simplifying fetch_payload_reg(),
      since there was no need for handling multiple-component payload
      registers except for barycentrics.
      
      This causes some minor shader-db noise due to the new helper emitting
      a LOAD_PAYLOAD instruction unconditionally, but it will be cleaned up
      shortly.
      
      Reviewed-by: Kenneth Graunke's avatarKenneth Graunke <kenneth@whitecape.org>
      44d7d66a
    • Francisco Jerez's avatar
      intel/fs/gen6: Use SEL instead of bashing thread payload for unlit centroid workaround. · 9c9e8010
      Francisco Jerez authored
      
      
      This prevents regressions on SNB due to the redundant MOVs lying
      around in cases where fetch_payload_reg() returns a VGRF (currently
      only in SIMD32 but soon in pretty much all cases).  The MOVs can't be
      register-coalesced due to their source being a FIXED_GRF, and they
      can't be copy-propagated either due to the unlit centroid workaround
      partial writes.  They can be copy-propagated just fine into a SEL
      instruction though.
      
      On SNB this prevents the following shader-db regressions (including
      SIMD32 programs) in combination with the interpolation rework part of
      this series:
      
         total instructions in shared programs: 13996898 -> 14001982 (0.04%)
         instructions in affected programs: 197461 -> 202545 (2.57%)
         helped: 0
         HURT: 1251
      
      Reviewed-by: Kenneth Graunke's avatarKenneth Graunke <kenneth@whitecape.org>
      9c9e8010
    • Francisco Jerez's avatar
      intel/fs/gen6: Generalize aligned_pairs_class to SIMD16 aligned barycentrics. · 0dd18d70
      Francisco Jerez authored
      
      
      This is mainly meant to avoid shader-db regressions on SNB as we start
      using VGRFs for barycentrics more frequently.  Currently the
      aligned_pairs_class is only useful in SIMD8 mode, because in SIMD16
      mode barycentric vectors are typically 4 GRFs.  This is not a problem
      on Gen4-5, because on those platforms all VGRF allocations are
      pair-aligned in SIMD16 mode.  However on Gen6 we end up using either
      the fast or the slow path of LINTERP rather non-deterministically
      based on the behavior of the register allocator.
      
      Fix it by repurposing aligned_pairs_class to hold PLN-aligned
      registers of whatever the natural size of a barycentric vector is in
      the current dispatch width.
      
      On SNB this prevents the following shader-db regressions (including
      SIMD32 programs) in combination with the interpolation rework part of
      this series:
      
         total instructions in shared programs: 13983257 -> 14527274 (3.89%)
         instructions in affected programs: 1766255 -> 2310272 (30.80%)
         helped: 0
         HURT: 11608
      
         LOST:   26
         GAINED: 13
      
      Reviewed-by: Kenneth Graunke's avatarKenneth Graunke <kenneth@whitecape.org>
      0dd18d70
    • Francisco Jerez's avatar
      intel/fs/gen6: Constrain barycentric source of LINTERP during bank conflict mitigation. · 0db4455c
      Francisco Jerez authored
      
      
      This avoids regressions on SNB due to the bank conflict mitigation
      pass moving a VGRF-allocated barycentric vector to a misaligned
      location, which would prevent the PLN instruction from being used.
      
      Reviewed-by: Kenneth Graunke's avatarKenneth Graunke <kenneth@whitecape.org>
      0db4455c
    • Francisco Jerez's avatar
      intel/fs/gen4-6: Allocate registers from aligned_pairs_class based on LINTERP use. · 369aef85
      Francisco Jerez authored
      
      
      Previously we would hardcode fs_visitor::delta_xy barycentrics to be
      allocated from aligned_pairs_class on hardware with PLN source
      alignment restrictions (pre-Gen7).  Instead allocate any registers
      consumed by LINTERP from aligned_pairs_class, even if some barycentric
      vector had ended up in a temporary.
      
      On SNB this prevents the following shader-db regressions (including
      SIMD32 programs) in combination with the interpolation rework part of
      this series:
      
         total instructions in shared programs: 13983257 -> 14527274 (3.89%)
         instructions in affected programs: 1766255 -> 2310272 (30.80%)
         helped: 0
         HURT: 11608
      
         LOST:   26
         GAINED: 13
      
      Reviewed-by: Kenneth Graunke's avatarKenneth Graunke <kenneth@whitecape.org>
      369aef85
    • Francisco Jerez's avatar
      intel/fs: Allow limited copy propagation of a LOAD_PAYLOAD into another. · 54b1b71e
      Francisco Jerez authored
      
      
      This is particularly useful in cases where register coalaesce is
      unlikely to succeed because the LOAD_PAYLOAD isn't a plain copy --
      E.g. when a LOAD_PAYLOAD is shuffling the contents of a barycentric
      vector in order to transform it into the PLN layout.
      
      This prevents the following shader-db regressions (including SIMD32
      programs) in combination with the interpolation rework part of this
      series.  On SKL:
      
         total instructions in shared programs: 18596672 -> 18976097 (2.04%)
         instructions in affected programs: 7937041 -> 8316466 (4.78%)
         helped: 39
         HURT: 67427
      
         LOST:   466
         GAINED: 220
      
      On SNB:
      
         total instructions in shared programs: 13993866 -> 14202963 (1.49%)
         instructions in affected programs: 7611309 -> 7820406 (2.75%)
         helped: 624
         HURT: 52943
      
         LOST:   6
         GAINED: 18
      
      Reviewed-by: Kenneth Graunke's avatarKenneth Graunke <kenneth@whitecape.org>
      54b1b71e
    • Francisco Jerez's avatar
      intel/fs: Add support for copy-propagating a block of multiple FIXED_GRFs. · 8eb4f209
      Francisco Jerez authored
      
      
      In cases where a LOAD_PAYLOAD instruction copies a single block of
      sequential GRF registers into the destination (see
      is_identity_payload()), splitting the block copy into a number of ACP
      entries (one for each LOAD_PAYLOAD source) is undesirable, because
      that prevents copy propagation into any instructions which read
      multiple components at once with the same source (the barycentric
      source of the LINTERP instruction is going to be the overwhelmingly
      most common example).
      
      Technically it would also be possible to do this for VGRF sources, but
      there is little benefit from that since register coalesce already
      covers many of those cases -- There is no way for a block of
      FIXED_GRFs to be coalesced into a VGRF though.
      
      This prevents the following shader-db regressions (including SIMD32
      programs) in combination with the interpolation rework part of this
      series.  On SKL:
      
         total instructions in shared programs: 18595160 -> 18828562 (1.26%)
         instructions in affected programs: 13374946 -> 13608348 (1.75%)
         helped: 7
         HURT: 108977
      
         total spills in shared programs: 9116 -> 9106 (-0.11%)
         spills in affected programs: 404 -> 394 (-2.48%)
         helped: 7
         HURT: 9
      
         total fills in shared programs: 8994 -> 9176 (2.02%)
         fills in affected programs: 898 -> 1080 (20.27%)
         helped: 7
         HURT: 9
      
         LOST:   469
         GAINED: 220
      
      On SNB:
      
         total instructions in shared programs: 13996898 -> 14096222 (0.71%)
         instructions in affected programs: 8088546 -> 8187870 (1.23%)
         helped: 2
         HURT: 66520
      
         total spills in shared programs: 2985 -> 2961 (-0.80%)
         spills in affected programs: 632 -> 608 (-3.80%)
         helped: 2
         HURT: 0
      
         total fills in shared programs: 3144 -> 3128 (-0.51%)
         fills in affected programs: 1515 -> 1499 (-1.06%)
         helped: 2
         HURT: 0
      
         LOST:   0
         GAINED: 4
      
      Reviewed-by: Kenneth Graunke's avatarKenneth Graunke <kenneth@whitecape.org>
      8eb4f209
Loading