1. 01 Mar, 2018 6 commits
    • Jason Ekstrand's avatar
      intel/fs: Set up sampler message headers in the visitor on gen7+ · ff472607
      Jason Ekstrand authored
      This gives the scheduler visibility into the headers which should
      improve scheduling.  More importantly, however, it lets the scheduler
      know that the header gets written.  As-is, the scheduler thinks that a
      texture instruction only reads it's payload and is unaware that it may
      write to the first register so it may reorder it with respect to a read
      from that register.  This is causing issues in a couple of Dota 2 vertex
      shaders.
      
      Bugzilla: https://bugs.freedesktop.org/show_bug.cgi?id=104923
      Cc: mesa-stable@lists.freedesktop.org
      Reviewed-by: Francisco Jerez's avatarFrancisco Jerez <currojerez@riseup.net>
      ff472607
    • José Casanova Crespo's avatar
      spirv/i965/anv: Relax push constant offset assertions being 32-bit aligned · 02266f9b
      José Casanova Crespo authored
      The introduction of 16-bit types with VK_KHR_16bit_storages implies that
      push constant offsets could be multiple of 2-bytes. Some assertions are
      updated so offsets should be just multiple of size of the base type but
      in some cases we can not assume it as doubles aren't aligned to 8 bytes
      in some cases.
      
      For 16-bit types, the push constant offset takes into account the
      internal offset in the 32-bit uniform bucket adding 2-bytes when we access
      not 32-bit aligned elements. In all 32-bit aligned cases it just becomes 0.
      
      v2: Assert offsets to be aligned to the dest type size. (Jason Ekstrand)
      Reviewed-by: Jason Ekstrand's avatarJason Ekstrand <jason@jlekstrand.net>
      02266f9b
    • José Casanova Crespo's avatar
      i965/fs: Support 16-bit store_ssbo with VK_KHR_relaxed_block_layout · 69be3a82
      José Casanova Crespo authored
      Restrict the use of untyped_surface_write with 16-bit pairs in
      ssbo to the cases where we can guarantee that offset is multiple
      of 4.
      
      Taking into account that VK_KHR_relaxed_block_layout is available
      in ANV we can only guarantee that when we have a constant offset
      that is multiple of 4. For non constant offsets we will always use
      byte_scattered_write.
      
      v2: (Jason Ekstrand)
          - Assert offset_reg to be multiple of 4 if it is immediate.
      Reviewed-by: Jason Ekstrand's avatarJason Ekstrand <jason@jlekstrand.net>
      69be3a82
    • José Casanova Crespo's avatar
      i965/fs: Support 16-bit do_read_vector with VK_KHR_relaxed_block_layout · 8dd8be03
      José Casanova Crespo authored
      16-bit load_ubo/ssbo operations that call do_untyped_read_vector don't
      guarantee that offsets are multiple of 4-bytes as required by untyped_read
      message. This happens for example in the case of f16mat3x3 when then
      VK_KHR_relaxed_block_layout is enabled.
      
      Vectors reads when we have non-constant offsets are implemented with
      multiple byte_scattered_read messages that not require 32-bit aligned offsets.
      
      Now for all constant offsets we can use the untyped_read_surface message.
      In the case of constant offsets not aligned to 32-bits, we calculate a
      start offset 32-bit aligned and use the shuffle_32bit_load_result_to_16bit_data
      function and the first_component parameter to skip the copy of the unneeded
      component.
      
      v2: (Jason Ekstrand)
          Use untyped_read_surface messages always we have constant offsets.
      
      v3: (Jason Ekstrand)
          Simplify loop for reads with non constant offsets.
          Use end - start to calculate the number of 32-bit components to read with
          constant offsets.
      Reviewed-by: Jason Ekstrand's avatarJason Ekstrand <jason@jlekstrand.net>
      8dd8be03
    • José Casanova Crespo's avatar
      i965/fs: shuffle_32bit_load_result_to_16bit_data now skips components · 2dd94f46
      José Casanova Crespo authored
      This helper used to load 16bit components from 32-bits read now allows
      skipping components with the new parameter first_component. The semantics
      now skip components until we reach the first_component, and then reads the
      number of components passed to the function.
      
      All previous uses of the helper are updated to use 0 as first_component.
      This will allow read 16-bit components when the first one is not aligned
      32-bit. Enabling more usages of untyped_reads with 16-bit types.
      
      v2: (Jason Ektrand)
          Change parameters order to first_component, num_components
      Reviewed-by: Jason Ekstrand's avatarJason Ekstrand <jason@jlekstrand.net>
      2dd94f46
    • José Casanova Crespo's avatar
      isl/i965/fs: SSBO/UBO buffers need size padding if not multiple of 32-bit · 67d7dd59
      José Casanova Crespo authored
      The surfaces that backup the GPU buffers have a boundary check that
      considers that access to partial dwords are considered out-of-bounds.
      For example, buffers with 1,3 16-bit elements has size 2 or 6 and the
      last two bytes would always be read as 0 or its writting ignored.
      
      The introduction of 16-bit types implies that we need to align the size
      to 4-bytew multiples so that partial dwords could be read/written.
      Adding an inconditional +2 size to buffers not being multiple of 2
      solves this issue for the general cases of UBO or SSBO.
      
      But, when unsized arrays of 16-bit elements are used it is not possible
      to know if the size was padded or not. To solve this issue the
      implementation calculates the needed size of the buffer surfaces,
      as suggested by Jason:
      
      surface_size = isl_align(buffer_size, 4) +
                     (isl_align(buffer_size, 4) - buffer_size)
      
      So when we calculate backwards the buffer_size in the backend we
      update the resinfo return value with:
      
      buffer_size = (surface_size & ~3) - (surface_size & 3)
      
      It is also exposed this buffer requirements when robust buffer access
      is enabled so these buffer sizes recommend being multiple of 4.
      
      v2: (Jason Ekstrand)
          Move padding logic fron anv to isl_surface_state.
          Move calculus of original size from spirv to driver backend.
      v3: (Jason Ekstrand)
          Rename some variables and use a similar expresion when calculating.
          padding than when obtaining the original buffer size.
          Avoid use of unnecesary component call at brw_fs_nir.
      v4: (Jason Ekstrand)
          Complete comment with buffer size calculus explanation in brw_fs_nir.
      Reviewed-by: Jason Ekstrand's avatarJason Ekstrand <jason@jlekstrand.net>
      67d7dd59
  2. 28 Feb, 2018 16 commits
  3. 27 Feb, 2018 2 commits
  4. 21 Feb, 2018 1 commit
  5. 14 Feb, 2018 2 commits
  6. 10 Feb, 2018 1 commit
  7. 06 Feb, 2018 2 commits
    • Timothy Arceri's avatar
      i965: remove unused brw_nir_lower_cs_shared() · ffeebcfa
      Timothy Arceri authored
      This has been unused since 8761a04d.
      Reviewed-by: default avatarElie Tournier <elie.tournier@collabora.com>
      ffeebcfa
    • Iago Toral's avatar
      i965/nir: do int64 lowering before optimization · 1d20001d
      Iago Toral authored
      Otherwise loop unrolling will fail to see the actual cost of
      the unrolling operations when the loop body contains 64-bit integer
      instructions, and very specially when the divmod64 lowering applies,
      since its lowering is quite expensive.
      
      Without this change, some in-development CTS tests for int64
      get stuck forever trying to register allocate a shader with
      over 50K SSA values. The large number of SSA values is the result
      of NIR first unrolling multiple seemingly simple loops that involve
      int64 instructions, only to then lower these instructions to produce
      a massive pile of code (due to the divmod64 lowering in the unrolled
      instructions).
      
      With this change, loop unrolling will see the loops with the int64
      code already lowered and will realize that it is too expensive to
      unroll.
      
      v2: Run nir_algebraic first so we can hopefully get rid of some of
          the int64 instructions before we even attempt to lower them.
      Reviewed-by: Matt Turner's avatarMatt Turner <mattst88@gmail.com>
      1d20001d
  8. 05 Feb, 2018 1 commit
    • Matt Turner's avatar
      i965: Move mistakenly placed line · e2b31e9a
      Matt Turner authored
      Ken called this out in review, but it seems I forgot to make the change.
      I noticed that the control flow annotations in the fragment shader
      disassembly of tests/shaders/glsl-fs-loop-continue.shader_test were not
      correct, and moving this line to the correct place fixes it.
      e2b31e9a
  9. 29 Jan, 2018 2 commits
    • Timothy Arceri's avatar
      nir: add vs_inputs_dual_locations compiler option · 5b8de4bd
      Timothy Arceri authored
      Allows nir drivers to either use a single or dual locations for
      vs double inputs.
      
      i965 uses dual locations for both OpenGL and Vulkan drivers, for
      now gallium OpenGL drivers only use a single location.
      
      The following patch will also make use of this option when
      calling nir_shader_gather_info().
      Reviewed-by: Karol Herbst's avatarKarol Herbst <kherbst@redhat.com>
      5b8de4bd
    • Timothy Arceri's avatar
      compiler: tidy up double_inputs_read uses · f63e05ae
      Timothy Arceri authored
      First we move double_inputs_read into a vs struct in the union,
      double_inputs_read is only used for vs inputs so this will
      save space and also allows us to add a new double_inputs field.
      
      We add the new field because c2acf97f changed the behaviour
      of double_inputs_read, and while it's no longer used to track
      actual reads in i965 we do still want to track this for gallium
      drivers.
      Reviewed-by: default avatarMarek Olšák <marek.olsak@amd.com>
      f63e05ae
  10. 26 Jan, 2018 1 commit
  11. 25 Jan, 2018 1 commit
    • Jason Ekstrand's avatar
      i965/fs: Reset the register file to VGRF in lower_integer_multiplication · db682b8f
      Jason Ekstrand authored
      18fde36c changed the way temporary
      registers were allocated in lower_integer_multiplication so that we
      allocate regs_written(inst) space and keep the stride of the original
      destination register.  This was to ensure that any MUL which originally
      followed the CHV/BXT integer multiply regioning restrictions would
      continue to follow those restrictions even after lowering.  This works
      fine except that I forgot to reset the register file to VGRF so, even
      though they were assigned a number from alloc.allocate(), they had the
      wrong register file.  This caused some GLES 3.0 CTS tests to start
      failing on Sandy Bridge due to attempted reads from the MRF:
      
          ES3-CTS.functional.shaders.precision.int.highp_mul_fragment.snbm64
          ES3-CTS.functional.shaders.precision.int.mediump_mul_fragment.snbm64
          ES3-CTS.functional.shaders.precision.int.lowp_mul_fragment.snbm64
          ES3-CTS.functional.shaders.precision.uint.highp_mul_fragment.snbm64
          ES3-CTS.functional.shaders.precision.uint.mediump_mul_fragment.snbm64
          ES3-CTS.functional.shaders.precision.uint.lowp_mul_fragment.snbm64
      
      This commit remedies this problem by, instead of copying inst->dst and
      overwriting nr, just make a new register and set the region to match
      inst->dst.
      
      Bugzilla: https://bugs.freedesktop.org/show_bug.cgi?id=103626
      Fixes: 18fde36c
      Cc: "17.3" <mesa-stable@lists.freedesktop.org>
      Reviewed-by: Matt Turner's avatarMatt Turner <mattst88@gmail.com>
      db682b8f
  12. 22 Jan, 2018 1 commit
  13. 17 Jan, 2018 1 commit
    • Francisco Jerez's avatar
      intel/fs: Optimize and simplify the copy propagation dataflow logic. · 11674dad
      Francisco Jerez authored
      Previously the dataflow propagation algorithm would calculate the ACP
      live-in and -out sets in a two-pass fixed-point algorithm.  The first
      pass would update the live-out sets of all basic blocks of the program
      based on their live-in sets, while the second pass would update the
      live-in sets based on the live-out sets.  This is incredibly
      inefficient in the typical case where the CFG of the program is
      approximately acyclic, because it can take up to 2*n passes for an ACP
      entry introduced at the top of the program to reach the bottom (where
      n is the number of basic blocks in the program), until which point the
      algorithm won't be able to reach a fixed point.
      
      The same effect can be achieved in a single pass by computing the
      live-in and -out sets in lock-step, because that makes sure that
      processing of any basic block will pick up the updated live-out sets
      of the lexically preceding blocks.  This gives the dataflow
      propagation algorithm effectively O(n) run-time instead of O(n^2) in
      the acyclic case.
      
      The time spent in dataflow propagation is reduced by 30x in the
      GLES31.functional.ssbo.layout.random.all_shared_buffer.5 dEQP
      test-case on my CHV system (the improvement is likely to be of the
      same order of magnitude on other platforms).  This more than reverses
      an apparent run-time regression in this test-case from my previous
      copy-propagation undefined-value handling patch, which was ultimately
      caused by the additional work introduced in that commit to account for
      undefined values being multiplied by a huge quadratic factor.
      
      According to Chad this test was failing on CHV due to a 30s time-out
      imposed by the Android CTS (this was the case regardless of my
      undefined-value handling patch, even though my patch substantially
      exacerbated the issue).  On my CHV system this patch reduces the
      overall run-time of the test by approximately 12x, getting us to
      around 13s, well below the time-out.
      
      v2: Initialize live-out set to the universal set to avoid rather
          pessimistic dataflow estimation in shaders with cycles (Addresses
          performance regression reported by Eero in GpuTest Piano).
          Performance numbers given above still apply.  No shader-db changes
          with respect to master.
      
      Bugzilla: https://bugs.freedesktop.org/show_bug.cgi?id=104271Reported-by: chadversary's avatarChad Versace <chadversary@chromium.org>
      Reviewed-by: default avatarIan Romanick <ian.d.romanick@intel.com>
      11674dad
  14. 11 Jan, 2018 3 commits