1. 17 Jun, 2018 1 commit
  2. 16 Jun, 2018 24 commits
    • José Casanova Crespo's avatar
    • José Casanova Crespo's avatar
    • José Casanova Crespo's avatar
    • José Casanova Crespo's avatar
      intel/fs: Use shuffle_from_32bit_read for 64-bit FS load_input · 71b319a2
      José Casanova Crespo authored
      As the previous use of shuffle_32bit_load_result_to_64bit_data
      had a source/destination overlap for 64-bit. Now a temporary destination
      is used for 64-bit cases to use shuffle_from_32bit_read that doesn't
      handle src/dst overlaps.
      Reviewed-by: Jason Ekstrand's avatarJason Ekstrand <jason@jlekstrand.net>
      71b319a2
    • José Casanova Crespo's avatar
      intel/fs: shuffle_from_32bit_read at load_per_vertex_input at TCS/TES · 8003ae87
      José Casanova Crespo authored
      Previously, the shuffle function had a source/destination overlap that
      needs to be avoided to use shuffle_from_32bit_read. As we can use for
      the shuffle destination the destination of removed MOVs.
      
      This change also avoids the internal MOVs done by the previous shuffle
      to deal with possible overlaps.
      Reviewed-by: Jason Ekstrand's avatarJason Ekstrand <jason@jlekstrand.net>
      8003ae87
    • José Casanova Crespo's avatar
      intel/fs: Use shuffle_from_32bit_read at VS load_input · 5565630f
      José Casanova Crespo authored
      shuffle_from_32bit_read manages 32-bit reads to 32-bit destination
      in the same way that the previous loop so now we just call the new
      function for all bitsizes, simplifying also the 64-bit load_input.
      
      v2: Add comment about future 16-bit support (Jason Ekstrand)
      Reviewed-by: Jason Ekstrand's avatarJason Ekstrand <jason@jlekstrand.net>
      5565630f
    • José Casanova Crespo's avatar
      intel/fs: Use shuffle_from_32bit_read for 64-bit gs_input_load · 152bffb6
      José Casanova Crespo authored
      This implementation avoids two unneeded MOVs for each 64-bit
      component. One was done in the old shuffle, to avoid cases of
      src/dst overlap but this is not the case. And the removed MOV
      was already being being done in the shuffle.
      
      Copy propagation wasn't able to remove them because shuffle
      destination values are defined with partial writes because they
      have stride == 2.
      
      v2: Reword commit log summary (Jason Ekstrand)
      Reviewed-by: Jason Ekstrand's avatarJason Ekstrand <jason@jlekstrand.net>
      152bffb6
    • José Casanova Crespo's avatar
      intel/fs: shuffle_from_32bit_read for 64-bit do_untyped_vector_read · 8b26a2d9
      José Casanova Crespo authored
      do_untyped_vector_read is used at load_ssbo and load_shared.
      
      The previous MOVs are removed because shuffle_from_32bit_read
      can handle storing the shuffle results in the expected destination
      just using the proper offset.
      Reviewed-by: Jason Ekstrand's avatarJason Ekstrand <jason@jlekstrand.net>
      8b26a2d9
    • José Casanova Crespo's avatar
    • José Casanova Crespo's avatar
    • José Casanova Crespo's avatar
      intel/fs: Use shuffle_from_32bit_read to read 16-bit SSBO · 20e4732f
      José Casanova Crespo authored
      Using shuffle_from_32bit_read instead of 16-bit shuffle functions
      avoids the need of retype. At the same time new function are
      ready for 8-bit type SSBO reads.
      Reviewed-by: Jason Ekstrand's avatarJason Ekstrand <jason@jlekstrand.net>
      20e4732f
    • José Casanova Crespo's avatar
      intel/fs: Use shuffle_from_32bit_read at VARYING_PULL_CONSTANT_LOAD · a0891eab
      José Casanova Crespo authored
      shuffle_from_32bit_read can manage the shuffle/unshuffle needed
      for different 8/16/32/64 bit-sizes at VARYING PULL CONSTANT LOAD.
      To get the specific component the first_component parameter is used.
      
      In the case of the previous 16-bit shuffle, the shuffle operation was
      generating not needed MOVs where its results where never used. This
      behaviour passed unnoticed on SIMD16 because dead_code_eliminate
      pass removed the generated instructions but for SIMD8 they cound't be
      removed because of being partial writes.
      Reviewed-by: Jason Ekstrand's avatarJason Ekstrand <jason@jlekstrand.net>
      a0891eab
    • José Casanova Crespo's avatar
      intel/fs: New shuffle_for_32bit_write and shuffle_from_32bit_read · 22c65494
      José Casanova Crespo authored
      These new shuffle functions deal with the shuffle/unshuffle operations
      needed for read/write operations using 32-bit components when the
      read/written components have a different bit-size (8, 16, 64-bits).
      Shuffle from 32-bit to 32-bit becomes a simple MOV.
      
      shuffle_src_to_dst takes care of doing a shuffle when source type is
      smaller than destination type and an unshuffle when source type is
      bigger than destination. So this new read/write functions just need
      to call shuffle_src_to_dst assuming that writes use a 32-bit
      destination and reads use a 32-bit source.
      
      As shuffle_for_32bit_write/from_32bit_read components take components
      in unit of source/destination types and shuffle_src_to_dst takes units
      of the smallest type component, we adjust components and first_component
      parameters.
      
      To enable this new functions it is needed than there is no
      source/destination overlap in the case of shuffle_from_32bit_read.
      That never happens on shuffle_for_32bit_write as it allocates a new
      destination register as it was at shuffle_64bit_data_for_32bit_write.
      
      v2: Reword commit log and add comments to explain why first_component
          and components parameters are adjusted. (Jason Ekstrand)
      Reviewed-by: Jason Ekstrand's avatarJason Ekstrand <jason@jlekstrand.net>
      22c65494
    • José Casanova Crespo's avatar
      intel/fs: general 8/16/32/64-bit shuffle_src_to_dst function · a5665056
      José Casanova Crespo authored
      This new function takes care of shuffle/unshuffle components of a
      particular bit-size in components with a different bit-size.
      
      If source type size is smaller than destination type size the operation
      needed is a component shuffle. The opposite case would be an unshuffle.
      
      Component units are measured in terms of the smaller type between
      source and destination. As we are un/shuffling the smaller components
      from/into a bigger one.
      
      The operation allows to skip first_component number of components from
      the source.
      
      Shuffle MOVs are retyped using integer types avoiding problems with
      denorms and float types if source and destination bitsize is different.
      This allows to simplify uses of shuffle functions that are dealing with
      these retypes individually.
      
      Now there is a new restriction so source and destination can not overlap
      anymore when calling this shuffle function. Following patches that migrate
      to use this new function will take care individually of avoiding source
      and destination overlaps.
      
      v2: (Jason Ekstrand)
          - Rewrite overlap asserts.
          - Manage type_sz(src.type) == type_sz(dst.type) case using MOVs
            from source to dest. This works for 64-bit to 64-bits
            operation that on Gen7 as it doesn't support Q registers.
          - Explain that components units are based in the smallest type.
      v3: - Fix unshuffle overlap assert (Jason Ekstrand)
      Reviewed-by: Jason Ekstrand's avatarJason Ekstrand <jason@jlekstrand.net>
      a5665056
    • Jose Fonseca's avatar
    • Bas Nieuwenhuizen's avatar
      ac: Clear meminfo to avoid valgrind warning. · c4714f69
      Bas Nieuwenhuizen authored
      Somehow valgrind misses that the value is initialized by the ioctl.
      Reviewed-by: default avatarDave Airlie <airlied@redhat.com>
      Reviewed-by: Samuel Pitoiset's avatarSamuel Pitoiset <samuel.pitoiset@gmail.com>
      c4714f69
    • Samuel Pitoiset's avatar
      radv: fix emitting the TCS regs on GFX9 · 5917761e
      Samuel Pitoiset authored
      The primitive ID is NULL and this generates an invalid
      select instruction which crashes because one operand is NULL.
      
      This fixes crashes in The Long Journey Home, Quantum Break
      and Just Cause 3 with DXVK.
      
      Bugzilla: https://bugs.freedesktop.org/show_bug.cgi?id=106756
      CC: <mesa-stable@lists.freedesktop.org>
      Signed-off-by: Samuel Pitoiset's avatarSamuel Pitoiset <samuel.pitoiset@gmail.com>
      Reviewed-by: Bas Nieuwenhuizen's avatarBas Nieuwenhuizen <bas@basnieuwenhuizen.nl>
      5917761e
    • Ian Romanick's avatar
      nir: Document a couple instances of parent_instr · 355868db
      Ian Romanick authored
      nir_ssa_def::parent_instr and nir_src::parent_instr have the same name,
      but they mean really different things.  I choose to save the next person
      the hour+ that I just spent figuring that out.  Even now that I know, I
      doubt I'd notice in code review that someone typed foo->parent_instr
      when they actually meant foo->ssa->parent_instr.
      
      v2: Minor wording tweak in nir_ssa_def::parent_instr.  Suggested by
      Jason.
      Signed-off-by: default avatarIan Romanick <ian.d.romanick@intel.com>
      Reviewed-by: Jason Ekstrand's avatarJason Ekstrand <jason@jlekstrand.net>
      355868db
    • Ian Romanick's avatar
      i965/fs: Propagate conditional modifiers from not instructions · 4467040c
      Ian Romanick authored
      Skylake
      total instructions in shared programs: 14399081 -> 14399010 (<.01%)
      instructions in affected programs: 26961 -> 26890 (-0.26%)
      helped: 57
      HURT: 0
      helped stats (abs) min: 1 max: 6 x̄: 1.25 x̃: 1
      helped stats (rel) min: 0.16% max: 0.80% x̄: 0.30% x̃: 0.18%
      95% mean confidence interval for instructions value: -1.50 -0.99
      95% mean confidence interval for instructions %-change: -0.35% -0.25%
      Instructions are helped.
      
      total cycles in shared programs: 532978307 -> 532976050 (<.01%)
      cycles in affected programs: 468629 -> 466372 (-0.48%)
      helped: 33
      HURT: 20
      helped stats (abs) min: 3 max: 360 x̄: 116.52 x̃: 98
      helped stats (rel) min: 0.06% max: 3.63% x̄: 1.66% x̃: 1.27%
      HURT stats (abs)   min: 2 max: 172 x̄: 79.40 x̃: 43
      HURT stats (rel)   min: 0.04% max: 3.02% x̄: 1.48% x̃: 0.44%
      95% mean confidence interval for cycles value: -81.29 -3.88
      95% mean confidence interval for cycles %-change: -1.07% 0.12%
      Inconclusive result (%-change mean confidence interval includes 0).
      
      All Gen6+ platforms, except Ivy Bridge, had similar results. (Haswell shown)
      total instructions in shared programs: 12973897 -> 12973838 (<.01%)
      instructions in affected programs: 25970 -> 25911 (-0.23%)
      helped: 55
      HURT: 0
      helped stats (abs) min: 1 max: 2 x̄: 1.07 x̃: 1
      helped stats (rel) min: 0.16% max: 0.62% x̄: 0.28% x̃: 0.18%
      95% mean confidence interval for instructions value: -1.14 -1.00
      95% mean confidence interval for instructions %-change: -0.32% -0.24%
      Instructions are helped.
      
      total cycles in shared programs: 410355841 -> 410352067 (<.01%)
      cycles in affected programs: 578454 -> 574680 (-0.65%)
      helped: 47
      HURT: 5
      helped stats (abs) min: 3 max: 360 x̄: 85.74 x̃: 18
      helped stats (rel) min: 0.05% max: 3.68% x̄: 1.18% x̃: 0.38%
      HURT stats (abs)   min: 2 max: 242 x̄: 51.20 x̃: 4
      HURT stats (rel)   min: <.01% max: 0.45% x̄: 0.15% x̃: 0.11%
      95% mean confidence interval for cycles value: -104.89 -40.27
      95% mean confidence interval for cycles %-change: -1.45% -0.66%
      Cycles are helped.
      
      Ivy Bridge
      total instructions in shared programs: 11679351 -> 11679301 (<.01%)
      instructions in affected programs: 28208 -> 28158 (-0.18%)
      helped: 50
      HURT: 0
      helped stats (abs) min: 1 max: 1 x̄: 1.00 x̃: 1
      helped stats (rel) min: 0.12% max: 0.54% x̄: 0.23% x̃: 0.16%
      95% mean confidence interval for instructions value: -1.00 -1.00
      95% mean confidence interval for instructions %-change: -0.27% -0.19%
      Instructions are helped.
      
      total cycles in shared programs: 257445362 -> 257444662 (<.01%)
      cycles in affected programs: 419338 -> 418638 (-0.17%)
      helped: 40
      HURT: 3
      helped stats (abs) min: 1 max: 170 x̄: 65.05 x̃: 24
      helped stats (rel) min: 0.02% max: 3.51% x̄: 1.26% x̃: 0.41%
      HURT stats (abs)   min: 2 max: 1588 x̄: 634.00 x̃: 312
      HURT stats (rel)   min: 0.05% max: 2.97% x̄: 1.21% x̃: 0.62%
      95% mean confidence interval for cycles value: -97.96 65.41
      95% mean confidence interval for cycles %-change: -1.56% -0.62%
      Inconclusive result (value mean confidence interval includes 0).
      
      No changes on Iron Lake or GM45.
      
      v2: Move 'if (cond != BRW_CONDITIONAL_Z && cond != BRW_CONDITIONAL_NZ)'
      check outside the loop.  Suggested by Iago.
      Signed-off-by: default avatarIan Romanick <ian.d.romanick@intel.com>
      4467040c
    • Ian Romanick's avatar
    • Ian Romanick's avatar
    • Ian Romanick's avatar
      i965/vec4: Optimize OR with 0 into a MOV · 22f9fbc0
      Ian Romanick authored
      All of the affected shaders are geometry shaders... the same ones from
      the similar fs changes.
      
      The "No changes on any other platforms" comment below is not quite
      right.  Without the previous change to register coalescing, this
      optimization caused quite a few regressions in tests that either used
      gl_ClipVertex or used different interpolation modes.  I observed that
      with both patches applied,
      glsl-1.10/execution/interpolation/interpolation-none-gl_BackSecondaryColor-smooth-vertex.shader_test
      was one instruction shorter.  I suspect other shaders would be similarly
      affected.  Since this is all based on NOS, shader-db does not reflect
      it.
      
      Haswell
      total instructions in shared programs: 12954955 -> 12954918 (<.01%)
      instructions in affected programs: 3603 -> 3566 (-1.03%)
      helped: 37
      HURT: 0
      helped stats (abs) min: 1 max: 1 x̄: 1.00 x̃: 1
      helped stats (rel) min: 0.21% max: 2.50% x̄: 1.99% x̃: 2.50%
      95% mean confidence interval for instructions value: -1.00 -1.00
      95% mean confidence interval for instructions %-change: -2.30% -1.69%
      Instructions are helped.
      
      total cycles in shared programs: 410012108 -> 410012098 (<.01%)
      cycles in affected programs: 3540 -> 3530 (-0.28%)
      helped: 5
      HURT: 0
      helped stats (abs) min: 2 max: 2 x̄: 2.00 x̃: 2
      helped stats (rel) min: 0.28% max: 0.28% x̄: 0.28% x̃: 0.28%
      95% mean confidence interval for cycles value: -2.00 -2.00
      95% mean confidence interval for cycles %-change: -0.28% -0.28%
      Cycles are helped.
      
      Ivy Bridge
      total instructions in shared programs: 11679387 -> 11679351 (<.01%)
      instructions in affected programs: 3292 -> 3256 (-1.09%)
      helped: 36
      HURT: 0
      helped stats (abs) min: 1 max: 1 x̄: 1.00 x̃: 1
      helped stats (rel) min: 0.21% max: 2.50% x̄: 2.04% x̃: 2.50%
      95% mean confidence interval for instructions value: -1.00 -1.00
      95% mean confidence interval for instructions %-change: -2.34% -1.74%
      Instructions are helped.
      
      No changes on any other platforms.
      Signed-off-by: default avatarIan Romanick <ian.d.romanick@intel.com>
      22f9fbc0
    • Ian Romanick's avatar
      i965/vec4: Don't register coalesce into source of VS_OPCODE_UNPACK_FLAGS_SIMD4X2 · e6a9bd97
      Ian Romanick authored
      This prevents regressions in a bunch of clipping and interpolation tests
      caused by the next patch (i965/vec4: Optimize OR with 0 into a MOV).
      Signed-off-by: default avatarIan Romanick <ian.d.romanick@intel.com>
      e6a9bd97
    • Ian Romanick's avatar
      i965/fs: Optimize OR with 0 into a MOV · 284b563f
      Ian Romanick authored
      fs_visitor::set_gs_stream_control_data_bits generates some code like
      "control_data_bits | stream_id << ((2 * (vertex_count - 1)) % 32)" as
      part of EmitVertex.  The first time this (dynamically) occurs in the
      shader, control_data_bits is zero.  Many times we can determine this
      statically and various optimizations will collaborate to make one of the
      OR operands literal zero.
      
      Converting the OR to a MOV usually allows it to be copy-propagated away.
      However, this does not happen in at least some shaders (in the assembly
      output of shaders/closed/UnrealEngine4/EffectsCaveDemo/301.shader_test,
      search for shl).
      
      All of the affected shaders are geometry shaders.
      
      Broadwell and Skylake had similar results. (Skylake shown)
      total instructions in shared programs: 14375452 -> 14375413 (<.01%)
      instructions in affected programs: 6422 -> 6383 (-0.61%)
      helped: 39
      HURT: 0
      helped stats (abs) min: 1 max: 1 x̄: 1.00 x̃: 1
      helped stats (rel) min: 0.14% max: 2.56% x̄: 1.91% x̃: 2.56%
      95% mean confidence interval for instructions value: -1.00 -1.00
      95% mean confidence interval for instructions %-change: -2.26% -1.57%
      Instructions are helped.
      
      total cycles in shared programs: 531981179 -> 531980555 (<.01%)
      cycles in affected programs: 27493 -> 26869 (-2.27%)
      helped: 39
      HURT: 0
      helped stats (abs) min: 16 max: 16 x̄: 16.00 x̃: 16
      helped stats (rel) min: 0.60% max: 7.92% x̄: 5.94% x̃: 7.92%
      95% mean confidence interval for cycles value: -16.00 -16.00
      95% mean confidence interval for cycles %-change: -6.98% -4.90%
      Cycles are helped.
      
      No changes on earlier platforms.
      Signed-off-by: default avatarIan Romanick <ian.d.romanick@intel.com>
      284b563f
  3. 15 Jun, 2018 15 commits