1. 01 Apr, 2020 1 commit
  2. 23 Mar, 2020 1 commit
  3. 16 Mar, 2020 1 commit
  4. 10 Mar, 2020 1 commit
  5. 14 Feb, 2020 1 commit
    • Francisco Jerez's avatar
      intel/fs: Set src0 alpha present bit in header when provided in message payload. · 57dee58c
      Francisco Jerez authored
      Currently the "Source0 Alpha Present to RenderTarget" bit of the RT
      write message header is derived from brw_wm_prog_data::replicate_alpha.
      However the src0_alpha payload is provided anytime it's specified to
      the logical message.  This could theoretically lead to an
      inconsistency if somebody provided a src0_alpha value while
      brw_wm_prog_data::replicate_alpha was false, as I'm planning to do in
      a future commit in order to implement a hardware workaround.
      
      Instead calculate the header bit based on whether a src0_alpha value
      was provided to the logical message, which guarantees the same
      behavior on pre-ICL and ICL+ (the latter used an extended descriptor
      bit for this which didn't suffer from the same issue).  Remove the
      brw_wm_prog_data::replicate_alpha flag.
      Reviewed-by: Kenneth Graunke's avatarKenneth Graunke <kenneth@whitecape.org>
      57dee58c
  6. 17 Jan, 2020 1 commit
    • Francisco Jerez's avatar
      intel/fs/gen6: Generalize aligned_pairs_class to SIMD16 aligned barycentrics. · 0dd18d70
      Francisco Jerez authored
      This is mainly meant to avoid shader-db regressions on SNB as we start
      using VGRFs for barycentrics more frequently.  Currently the
      aligned_pairs_class is only useful in SIMD8 mode, because in SIMD16
      mode barycentric vectors are typically 4 GRFs.  This is not a problem
      on Gen4-5, because on those platforms all VGRF allocations are
      pair-aligned in SIMD16 mode.  However on Gen6 we end up using either
      the fast or the slow path of LINTERP rather non-deterministically
      based on the behavior of the register allocator.
      
      Fix it by repurposing aligned_pairs_class to hold PLN-aligned
      registers of whatever the natural size of a barycentric vector is in
      the current dispatch width.
      
      On SNB this prevents the following shader-db regressions (including
      SIMD32 programs) in combination with the interpolation rework part of
      this series:
      
         total instructions in shared programs: 13983257 -> 14527274 (3.89%)
         instructions in affected programs: 1766255 -> 2310272 (30.80%)
         helped: 0
         HURT: 11608
      
         LOST:   26
         GAINED: 13
      Reviewed-by: Kenneth Graunke's avatarKenneth Graunke <kenneth@whitecape.org>
      0dd18d70
  7. 18 Nov, 2019 1 commit
  8. 11 Oct, 2019 1 commit
  9. 18 Sep, 2019 1 commit
    • Kenneth Graunke's avatar
      intel/compiler: Record whether any pull constant loads occur · 0e4a75f9
      Kenneth Graunke authored
      I would like for iris to be able to avoid setting up SURFACE_STATE
      for UBOs in the common case where all constants are pushed.
      
      Unfortunately, we don't know up front whether everything will be
      pushed: the backend is allowed to demote pushed UBOs to pull loads
      fairly late in the process.  This is probably desirable though, as
      we'd like the backend to be able to re-pull pushed data to break up
      long live ranges in response to register pressure.
      
      Here we simply add a "are there any pull loads at all" boolean to
      prog_data, which is a bit crude but at least allows us to skip work
      in the common "everything pushed" case.  We could skip more work by
      tracking exactly which UBO surfaces are pulled in a bitmask, but I
      wanted to avoid bringing back the old mark_surface_used() mechanism.
      
      Finer-grained tracking could allow us to skip a bit more work when
      multiple UBOs are in use and /some/ are 100% pushed, but others are
      accessed via pulls.  However, I'm not sure how common this is and
      it would save at most 4 pull descriptors, so we defer that for now.
      Reviewed-by: Caio Marcelo de Oliveira Filho's avatarCaio Marcelo de Oliveira Filho <caio.oliveira@intel.com>
      0e4a75f9
  10. 25 Aug, 2019 1 commit
  11. 12 Aug, 2019 2 commits
  12. 01 Aug, 2019 2 commits
  13. 31 Jul, 2019 1 commit
  14. 24 Jul, 2019 3 commits
  15. 10 Jul, 2019 1 commit
  16. 21 May, 2019 1 commit
  17. 14 May, 2019 1 commit
    • Kenneth Graunke's avatar
      intel/compiler: Implement TCS 8_PATCH mode and INTEL_DEBUG=tcs8 · 646924cf
      Kenneth Graunke authored
      Our tessellation control shaders can be dispatched in several modes.
      
      - SINGLE_PATCH (Gen7+) processes a single patch per thread, with each
        channel corresponding to a different patch vertex.  PATCHLIST_N will
        launch (N / 8) threads.  If N is less than 8, some channels will be
        disabled, leaving some untapped hardware capabilities.  Conditionals
        based on gl_InvocationID are non-uniform, which means that they'll
        often have to execute both paths.  However, if there are fewer than
        8 vertices, all invocations will happen within a single thread, so
        barriers can become no-ops, which is nice.  We also burn a maximum
        of 4 registers for ICP handles, so we can compile without regard for
        the value of N.  It also works in all cases.
      
      - DUAL_PATCH mode processes up to two patches at a time, where the first
        four channels come from patch 1, and the second group of four come
        from patch 2.  This tries to provide better EU utilization for small
        patches (N <= 4).  It cannot be used in all cases.
      
      - 8_PATCH mode processes 8 patches at a time, with a thread launched per
        vertex in the patch.  Each channel corresponds to the same vertex, but
        in each of the 8 patches.  This utilizes all channels even for small
        patches.  It also makes conditions on gl_InvocationID uniform, leading
        to proper jumps.  Barriers, unfortunately, become real.  Worse, for
        PATCHLIST_N, the thread payload burns N registers for ICP handles.
        This can burn up to 32 registers, or 1/4 of our register file, for
        URB handles.  For Vulkan (and DX), we know the number of vertices at
        compile time, so we can limit the amount of waste.  In GL, the patch
        dimension is dynamic state, so we either would have to waste all 32
        (not reasonable) or guess (badly) and recompile.  This is unfortunate.
        Because we can only spawn 16 thread instances, we can only use this
        mode for PATCHLIST_16 and smaller.  The rest must use SINGLE_PATCH.
      
      This patch implements the new 8_PATCH TCS mode, but leaves us using
      SINGLE_PATCH by default.  A new INTEL_DEBUG=tcs8 flag will switch to
      using 8_PATCH mode for testing and benchmarking purposes.  We may
      want to consider using 8_PATCH mode in Vulkan in some cases.
      
      The data I've seen shows that 8_PATCH mode can be more efficient in
      some cases, but SINGLE_PATCH mode (the one we use today) is faster
      in other cases.  Ultimately, the TES matters much more than the TCS
      for performance, so the decision may not matter much.
      Reviewed-by: Jason Ekstrand's avatarJason Ekstrand <jason@jlekstrand.net>
      646924cf
  18. 16 Apr, 2019 1 commit
    • Kenneth Graunke's avatar
      i965: Move program key debugging to the compiler. · fad7801a
      Kenneth Graunke authored
      The i965 driver has a bunch of code to compare two sets of program keys
      and print out the differences.  This can be useful for debugging why a
      shader needed to be recompiled on the fly due to non-orthogonal state
      dependencies.  anv doesn't do recompiles, so we didn't need to share
      this in the past - but I'd like to use it in iris.
      
      This moves the bulk of the code to the compiler where it can be reused.
      To make that possible, we need to decouple it from i965 - we can't get
      at the brw program cache directly, nor use brw_context to print things.
      Instead, we use compiler->shader_perf_log(), and simply pass in keys.
      
      We put all of this debugging code in brw_debug_recompile.c, and only
      export a single function, for simplicity.  I also tidied the code a
      bit while moving it, now that it all lives in one file.
      Reviewed-by: Jordan Justen's avatarJordan Justen <jordan.l.justen@intel.com>
      fad7801a
  19. 25 Mar, 2019 1 commit
    • Danylo Piliaiev's avatar
      i965,iris,anv: Make alpha to coverage work with sample mask · c8abe03f
      Danylo Piliaiev authored
      From "Alpha Coverage" section of SKL PRM Volume 7:
       "If Pixel Shader outputs oMask, AlphaToCoverage is disabled in
        hardware, regardless of the state setting for this feature."
      
      From OpenGL spec 4.6, "15.2 Shader Execution":
       "The built-in integer array gl_SampleMask can be used to change
       the sample coverage for a fragment from within the shader."
      
      From OpenGL spec 4.6, "17.3.1 Alpha To Coverage":
       "If SAMPLE_ALPHA_TO_COVERAGE is enabled, a temporary coverage value
        is generated where each bit is determined by the alpha value at the
        corresponding sample location. The temporary coverage value is then
        ANDed with the fragment coverage value to generate a new fragment
        coverage value."
      
      Similar wording could be found in Vulkan spec 1.1.100
      "25.6. Multisample Coverage"
      
      Thus we need to compute alpha to coverage dithering manually in shader
      and replace sample mask store with the bitwise-AND of sample mask and
      alpha to coverage dithering.
      
      The following formula is used to compute final sample mask:
        m = int(16.0 * clamp(src0_alpha, 0.0, 1.0))
        dither_mask = 0x1111 * ((0xfea80 >> (m & ~3)) & 0xf) |
           0x0808 * (m & 2) | 0x0100 * (m & 1)
        sample_mask = sample_mask & dither_mask
      Credits to Francisco Jerez <currojerez@riseup.net> for creating it.
      
      It gives a number of ones proportional to the alpha for 2, 4, 8 or 16
      least significant bits of the result.
      
      GEN6 hardware does not have issue with simultaneous usage of sample mask
      and alpha to coverage however due to the wrong sending order of oMask
      and src0_alpha it is still affected by it.
      
      Bugzilla: https://bugs.freedesktop.org/show_bug.cgi?id=109743Signed-off-by: Danylo Piliaiev's avatarDanylo Piliaiev <danylo.piliaiev@globallogic.com>
      Reviewed-by: Francisco Jerez's avatarFrancisco Jerez <currojerez@riseup.net>
      c8abe03f
  20. 06 Mar, 2019 1 commit
  21. 26 Feb, 2019 1 commit
  22. 21 Feb, 2019 1 commit
  23. 12 Feb, 2019 1 commit
  24. 13 Jan, 2019 1 commit
    • Kenneth Graunke's avatar
      i965: Drop mark_surface_used mechanism. · 04c2f12a
      Kenneth Graunke authored
      The original idea was that the backend compiler could eliminate
      surfaces, so we would have it mark which ones are actually used,
      then shrink the binding table accordingly.  Unfortunately, it's a
      pretty blunt mechanism - it can only prune things from the end,
      not the middle - since we decide the layout before we even start
      the backend compiler, and only limit the size.  It also basically
      gives up if it sees indirect array access.
      
      Besides, we do the vast majority of our surface elimination in NIR
      anyway, not the backend - and I don't see that trend changing any
      time soon.  Vulkan abandoned this plan a long time ago, and I don't
      use it in Iris, but it's still been kicking around in i965.
      
      I hacked shader-db to print the binding table size in bytes, and
      observed no changes with this patch.  So, this code appears to do
      nothing useful.
      Acked-by: Jason Ekstrand's avatarJason Ekstrand <jason@jlekstrand.net>
      04c2f12a
  25. 20 Nov, 2018 1 commit
    • Kenneth Graunke's avatar
      i965: Do NIR shader cloning in the caller. · 562448b7
      Kenneth Graunke authored
      This moves nir_shader_clone() to the driver-specific compile function,
      rather than the shared src/intel/compiler code.  This allows i965 to do
      key-specific passes before calling brw_compile_*.  Vulkan should not
      need this cloning as it doesn't compile multiple variants.
      
      We do need to continue cloning in the compute shader code because we
      lower various things in NIR based on the SIMD width.
      Reviewed-by: Alejandro Piñeiro's avatarAlejandro Piñeiro <apinheiro@igalia.com>
      562448b7
  26. 12 Nov, 2018 1 commit
  27. 29 Aug, 2018 1 commit
  28. 02 Aug, 2018 2 commits
  29. 02 Jul, 2018 1 commit
  30. 28 Jun, 2018 4 commits
  31. 02 May, 2018 2 commits