1. 13 Feb, 2020 1 commit
    • Lionel Landwerlin's avatar
      i965: enable INTEL_blackhole_render · 5d7e9edb
      Lionel Landwerlin authored
      
      
      v2: condition the extension on context isolation support from the
          kernel (Chris)
      
      v3: (Lionel)
      
          The initial version of this change used a feature of the Gen7+
          command parser to turn the primitive instructions into no-ops.
          Unfortunately this doesn't play well with how we're using the
          hardware outside of the user submitted commands. For example
          resolves are implicit operations which should not be turned into
          no-ops as part of the previously submitted commands (before
          blackhole_render is enabled) might not be disabled. For example
          this sequence :
      
             glClear();
             glEnable(GL_BLACKHOLE_RENDER_INTEL);
             glDrawArrays(...);
             glReadPixels(...);
             glDisable(GL_BLACKHOLE_RENDER_INTEL);
      
          While clear has been emitted outside the blackhole render, it
          should still be resolved properly in the read pixels. Hence we
          need to be more selective and only disable user submitted
          commands.
      
          This v3 manually turns primitives into MI_NOOP if blackhole render
          is enabled. This lets us enable this feature on any platform.
      
      v4: Limit support to gen7.5+ (Lionel)
      
      v5: Enable Gen7.5 support again, requires a kernel update of the
          command parser (Lionel)
      
      v6: Disable Gen7.5 again... Kernel devs want these patches landed
          before they accept the kernel patches to whitelist INSTPM (Lionel)
      
      v7: Simplify change by never holding noop (there was a shortcoming in the test not considering fast clears)
          Only program register using MI_LRI (Lionel)
      
      v8: Switch to software managed blackhole (BDW hangs on compute batches...)
      
      v9: Simplify the noop state tracking (Lionel)
      
      v10: Don't modify flush function (Ken)
      Signed-off-by: Lionel Landwerlin's avatarLionel Landwerlin <lionel.g.landwerlin@intel.com>
      Reviewed-by: Kenneth Graunke <kenneth@whitecape.org> (v8)
      Part-of: <mesa/mesa!2964>
      5d7e9edb
  2. 20 Nov, 2018 1 commit
  3. 01 Nov, 2018 1 commit
  4. 30 Oct, 2018 1 commit
  5. 10 Sep, 2018 1 commit
  6. 05 Jun, 2018 1 commit
    • Kenneth Graunke's avatar
      i965: Prepare batchbuffer module for softpin support. · 1c9053d0
      Kenneth Graunke authored
      
      
      If EXEC_OBJECT_PINNED is set, we don't want to emit any relocations.
      We simply want to add the BO to the validation list, and possibly mark
      it as writeable.  The new brw_use_pinned_bo() interface does just that.
      
      To avoid having to make every caller consider both the relocation and
      softpin cases, we make emit_reloc() call brw_use_pinned_bo() when given
      a softpinned buffer.
      
      We also can't grow buffers that are softpinned - the mechanism places a
      larger BO at the same offset as the original, which requires moving BOs
      around in the VMA.  With softpin, we only allocate enough VMA for the
      original size of the BO.
      
      v2: Assert that BOs aren't pinned if the kernel says we should move them
          (feedback from Chris Wilson)
      Reviewed-by: D Scott Phillips's avatarScott D Phillips <scott.d.phillips@intel.com>
      1c9053d0
  7. 22 May, 2018 1 commit
  8. 01 Mar, 2018 1 commit
    • Kenneth Graunke's avatar
      i965: Allow 48-bit addressing on Gen8+. · cee9f389
      Kenneth Graunke authored
      
      
      This allows most GPU objects to use the full 48-bit address space
      offered by Gen8+ platforms, rather than being stuck with 32-bit.
      This expands the available GPU memory from 4G to 256TB or so.
      
      A few objects - instruction, scratch, and vertex buffers - need to
      remain pinned in the low 4GB of the address space for various reasons.
      We default everything to 48-bit but disable it in those cases.
      
      Thanks to Jason Ekstrand for blazing this trail in anv first and
      finding the nasty undocumented hardware issues.  This patch simply
      rips off all of his findings.
      Reviewed-by: Jordan Justen's avatarJordan Justen <jordan.l.justen@intel.com>
      Acked-by: Jason Ekstrand's avatarJason Ekstrand <jason@jlekstrand.net>
      cee9f389
  9. 07 Jan, 2018 1 commit
  10. 30 Nov, 2017 1 commit
  11. 29 Nov, 2017 1 commit
  12. 18 Sep, 2017 1 commit
  13. 14 Sep, 2017 3 commits
    • Kenneth Graunke's avatar
      i965: Delete BATCH_RESERVED handling. · 2c46a67b
      Kenneth Graunke authored
      
      
      Now that we can grow the batchbuffer if we absolutely need the extra
      space, we don't need to reserve space for the final do-or-die ending
      commands.
      Reviewed-by: Matt Turner's avatarMatt Turner <mattst88@gmail.com>
      Reviewed-by: Chris Wilson's avatarChris Wilson <chris@chris-wilson.co.uk>
      2c46a67b
    • Kenneth Graunke's avatar
      i965: Use a separate state buffer, but avoid changing flushing behavior. · 78c404f1
      Kenneth Graunke authored
      
      
      Previously, we emitted GPU commands and indirect state into the same
      buffer, using a stack/heap like system where we filled in commands from
      the start of the buffer, and state from the end of the buffer.  We then
      flushed before the two met in the middle.
      
      Meeting in the middle is fatal, so you have to be certain that you
      reserve the correct amount of space before emitting commands or state
      for a draw.  Currently, we will assert !no_batch_wrap and die if the
      estimate is ever too small.  This has been mercifully obscure, but has
      happened on a number of occasions, and could in theory happen to any
      application that issues a large draw at just the wrong time.
      
      Estimating the amount of batch space required is painful - it's hard to
      get right, and getting it right involves a lot of code that would burn
      CPU time, and also be painful to maintain.  Rolling back to a saved
      state and retrying is also painful - failing to save/restore all the
      required state will break things, and redoing state emission burns a
      lot of CPU.  memcpy'ing to a new batch and continuing is painful,
      because commands we issue for a draw depend on earlier commands as well
      (such as STATE_BASE_ADDRESS, or the GPU being in a pirtacular state).
      
      The best plan is to never run out of space, which is totally doable but
      pretty wasteful - a pessimal draw requires a huge amount of space, and
      rarely occurs.  Instead, we'd like to grow the batch buffer if we need
      more space and can't safely flush.
      
      We can't grow with a meet in the middle approach - we'd have to move the
      state to the end, which would mean updating every offset from dynamic
      state base address.  Using separate batch and state buffers, where both
      fill starting at the beginning, makes it easy to grow either as needed.
      
      This patch separates the two concepts.  We create a separate state
      buffer, with a second relocation list, and use that for brw_state_batch.
      
      However, this patch tries to retain the original flushing behavior - it
      adds the amount of batch and state space together, as if they were still
      co-existing in a single buffer.  The hope is to flush at the same time
      as before.  This is necessary to avoid provoking bugs caused by broken
      batch wrap handling (which we'll fix shortly).  It also avoids suddenly
      increasing the size of the batch (due to state not taking up space),
      which could have a significant performance impact.  We'll tune it later.
      
      v2:
      - Mark the statebuffer with EXEC_OBJECT_CAPTURE when supported (caught
        by Chris).  Unfortunately, we lose the ability to capture state data
        on older kernels.
      - Continue to support the malloc'd shadow buffers.
      Reviewed-by: Matt Turner's avatarMatt Turner <mattst88@gmail.com>
      Reviewed-by: Chris Wilson's avatarChris Wilson <chris@chris-wilson.co.uk>
      78c404f1
    • Kenneth Graunke's avatar
      i965: Split brw_emit_reloc into brw_batch_reloc and brw_state_reloc. · e7232559
      Kenneth Graunke authored
      
      
      brw_batch_reloc emits a relocation from the batchbuffer to elsewhere.
      brw_state_reloc emits a relocation from the statebuffer to elsewhere.
      
      For now, they do the same thing, but when we actually split the two
      buffers, we'll change brw_state_reloc to use the state buffer.
      Reviewed-by: Matt Turner's avatarMatt Turner <mattst88@gmail.com>
      Reviewed-by: Chris Wilson's avatarChris Wilson <chris@chris-wilson.co.uk>
      e7232559
  14. 12 Aug, 2017 1 commit
  15. 04 Aug, 2017 1 commit
    • Chris Wilson's avatar
      i965: Reduce passing 2x32b of reloc_domains to 2 bits · 6c530ad1
      Chris Wilson authored
      
      
      The kernel only cares about whether the object is to be written to or
      not, only reduces (reloc.read_domains, reloc.write_domain) down to just
      !!reloc.write_domain. When we use NO_RELOC, the kernel doesn't even read
      those relocs and instead userspace has to pass that information in the
      execobject.flags. We can simplify our reloc api by also removing the
      unused read/write domains and only pass the resultant flags.
      
      The caveat to the above are when we need to make the kernel aware that
      certain objects need to take into account different work arounds.
      Previously, this was done using the magic (INSTRUCTION, INSTRUCTION)
      reloc domains. NO_RELOC requires this to be passed in the execobject
      flags as well, and now we push that up the callstack.
      
      The API is more compact, more expressive of what happens underneath, but
      unfortunately requires more knowledge of the system at the point of use.
      Conversely it also means that knowledge is specific and not generally
      applied and so not overused.
      
         text	   data	    bss	    dec	    hex	filename
      8502991	 356912	 424944	9284847	 8dacef	lib/i965_dri.so (before)
      8500455	 356912	 424944	9282311	 8da307	lib/i965_dri.so (after)
      
      v2: (by Ken) Rebase.
      Reviewed-by: Kenneth Graunke's avatarKenneth Graunke <kenneth@whitecape.org>
      6c530ad1
  16. 18 Jul, 2017 1 commit
  17. 10 Apr, 2017 7 commits
    • Kenneth Graunke's avatar
      i965/drm: Rename drm_bacon_bo to brw_bo. · d30a9273
      Kenneth Graunke authored
      
      
      The bacon is all gone.
      
      This renames both the class and the related functions.  We're about to
      run indent on the bufmgr code, so no need to worry about fixing bad
      indentation.
      Acked-by: Jason Ekstrand's avatarJason Ekstrand <jason@jlekstrand.net>
      d30a9273
    • Kenneth Graunke's avatar
      i965/drm: Rename drm_bacon_bufmgr to struct brw_bufmgr. · 662a733d
      Kenneth Graunke authored
      
      
      Also stop using typedefs, per Mesa coding style.
      Acked-by: Jason Ekstrand's avatarJason Ekstrand <jason@jlekstrand.net>
      662a733d
    • Kenneth Graunke's avatar
      i965/drm: Rewrite relocation handling. · eb41aa82
      Kenneth Graunke authored
      
      
      The execbuf2 kernel API requires us to construct two kinds of lists.
      First is a "validation list" (struct drm_i915_gem_exec_object2[])
      containing each BO referenced by the batch.  (The batch buffer itself
      must be the last entry in this list.)  Each validation list entry
      contains a pointer to the second kind of list: a relocation list.
      The relocation list contains information about pointers to BOs that
      the kernel may need to patch up if it relocates objects within the VMA.
      
      This is a very general mechanism, allowing every BO to contain pointers
      to other BOs.  libdrm_intel models this by giving each drm_intel_bo a
      list of relocations to other BOs.  Together, these form "reloc trees".
      
      Processing relocations involves a depth-first-search of the relocation
      trees, starting from the batch buffer.  Care has to be taken not to
      double-visit buffers.  Creating the validation list has to be deferred
      until the last minute, after all relocations are emitted, so we have the
      full tree present.  Calculating the amount of aperture space required to
      pin those BOs also involves tree walking, which is expensive, so libdrm
      has hacks to try and perform less expensive estimates.
      
      For some reason, it also stored the validation list in the global
      (per-screen) bufmgr structure, rather than as an local variable in the
      execbuffer function, requiring locking for no good reason.
      
      It also assumed that the batch would probably contain a relocation
      every 2 DWords - which is absurdly high - and simply aborted if there
      were more relocations than the max.  This meant the first relocation
      from a BO would allocate 180kB of data structures!
      
      This is way too complicated for our needs.  i965 only emits relocations
      from the batchbuffer - all GPU commands and state such as SURFACE_STATE
      live in the batch BO.  No other buffer uses relocations.  This means we
      can have a single relocation list for the batchbuffer.  We can add a BO
      to the validation list (set) the first time we emit a relocation to it.
      We can easily keep a running tally of the aperture space required for
      that list by adding the BO size when we add it to the validation list.
      
      This patch overhauls the relocation system to do exactly that.  There
      are many nice benefits:
      
      - We have a flat relocation list instead of trees.
      - We can produce the validation list up front.
      - We can allocate smaller arrays and dynamically grow them.
      - Aperture space checks are now (a + b <= c) instead of a tree walk.
      - brw_batch_references() is a trivial validation list walk.
        It should be straightforward to make it O(1) in the future.
      - We don't need to bloat each drm_bacon_bo with 32B of reloc data.
      - We don't need to lock in execbuffer, as the data structures are
        context-local, and not per-screen.
      - Significantly less code and a better match for what we're doing.
      - The simpler system should make it easier to take advantage of
        I915_EXEC_NO_RELOC in a future patch.
      
      Improves performance in Synmark 7.0's OglBatch7:
      
          - Skylake GT4e: 12.1499% +/- 2.29531%  (n=130)
          - Apollolake:   3.89245% +/- 0.598945% (n=35)
      
      Improves performance in GFXBench4's gl_driver2 test:
      
          - Skylake GT4e: 3.18616% +/- 0.867791% (n=229)
          - Apollolake:   4.1776%  +/- 0.240847% (n=120)
      
      v2: Feedback from Chris Wilson:
          - Omit explicit zero initializers for garbage execbuf fields.
          - Use .rsvd1 = ctx_id rather than i915_execbuffer2_set_context_id
          - Drop unnecessary fencing assertions.
          - Only use _WR variant of execbuf ioctl when necessary.
          - Shrink the arrays to be smaller by default.
      Reviewed-by: Jason Ekstrand's avatarJason Ekstrand <jason@jlekstrand.net>
      eb41aa82
    • Kenneth Graunke's avatar
      i965: Make/use a brw_batch_references() wrapper. · 6079f4f1
      Kenneth Graunke authored
      
      
      We'll want to change the implementation of this shortly.
      Reviewed-by: Chris Wilson's avatarChris Wilson <chris@chris-wilson.co.uk>
      Acked-by: Jason Ekstrand's avatarJason Ekstrand <jason@jlekstrand.net>
      6079f4f1
    • Kenneth Graunke's avatar
      i965: Use brw_emit_reloc() instead of drm_bacon_bo_emit_reloc(). · 6537a3ca
      Kenneth Graunke authored
      
      
      I'm about to make brw_emit_reloc do actual work, so everybody needs
      to start using it and not the raw drm_bacon function.
      Reviewed-by: Chris Wilson's avatarChris Wilson <chris@chris-wilson.co.uk>
      Acked-by: Jason Ekstrand's avatarJason Ekstrand <jason@jlekstrand.net>
      6537a3ca
    • Kenneth Graunke's avatar
      i965: Change intel_batchbuffer_reloc() into brw_emit_reloc(). · eadd5d1b
      Kenneth Graunke authored
      
      
      This renames intel_batchbuffer_reloc to brw_emit_reloc and changes the
      parameter naming and ordering to match drm_intel_bo_emit_reloc().
      
      For now, it's a trivial wrapper that accesses batch->bo.  When we
      rework relocations, it will start doing actual work.
      
      target_offset should be expanded to a uint64_t to match the kernel,
      but for now we leave it as its original 32-bit type.
      Reviewed-by: Chris Wilson's avatarChris Wilson <chris@chris-wilson.co.uk>
      Acked-by: Jason Ekstrand's avatarJason Ekstrand <jason@jlekstrand.net>
      eadd5d1b
    • Kenneth Graunke's avatar
      i965/drm: Use our internal libdrm (drm_bacon) rather than the real one. · eed86b97
      Kenneth Graunke authored
      
      
      Now we can actually test our changes.
      Acked-by: Jason Ekstrand's avatarJason Ekstrand <jason@jlekstrand.net>
      eed86b97
  18. 30 Mar, 2017 2 commits
  19. 09 Mar, 2017 1 commit
  20. 27 Jan, 2017 1 commit
  21. 04 Jan, 2017 4 commits
  22. 19 Aug, 2016 1 commit
  23. 07 Jul, 2016 1 commit
  24. 30 Mar, 2016 1 commit
  25. 09 Dec, 2015 1 commit
  26. 11 Sep, 2015 1 commit
  27. 15 Jul, 2015 2 commits
    • Matt Turner's avatar
      i965: Optimize batchbuffer macros. · f11c6f09
      Matt Turner authored
      Previously OUT_BATCH was just a macro around an inline function which
      does
      
         brw->batch.map[brw->batch.used++] = dword;
      
      When making consecutive calls to intel_batchbuffer_emit_dword() the
      compiler isn't able to recognize that we're writing consecutive memory
      locations or that it doesn't need to write batch.used back to memory
      each time.
      
      We can avoid both of these problems by making a local pointer to the
      next location in the batch in BEGIN_BATCH().
      
      Cuts 18k from the .text size.
      
         text     data      bss      dec      hex  filename
      4946956   195152    26192  5168300   4edcac  i965_dri.so before
      4928956   195152    26192  5150300   4e965c  i965_dri.so after
      
      This series (including commit c0433948
      
      ) improves performance of Synmark
      OglBatch7 by 8.01389% +/- 0.63922% (n=83) on Ivybridge.
      Reviewed-by: Chris Wilson's avatarChris Wilson <chris@chris-wilson.co.uk>
      f11c6f09
    • Matt Turner's avatar
      i965: Add and use USED_BATCH macro. · 131573df
      Matt Turner authored
      
      
      The next patch will replace the .used field with an on-demand
      calculation of batchbuffer usage.
      Reviewed-by: Chris Wilson's avatarChris Wilson <chris@chris-wilson.co.uk>
      131573df