Commits · ppir-liveness-fixes · Erico Nunes / mesa

Jan 20, 2020

lima/ppir: handle write to dead registers in ppir · d01ffcd9

Erico Nunes authored 5 years ago


nir can output writes to dead registers when expanding vec4 operations
to non-ssa registers. In that case, some components of the vec4 may be
assigned but never read. The ppir scheduler reorders instructions and
may place such an instruction writing to a dead register somewhere else
in the program.
In order to prevent regalloc from allocating a live register for this
operation, an interference must be assigned to it during liveness
analysis.

This workaround may be removed in the future if the assignments to dead
components can be removed earlier in ppir or nir.

Signed-off-by: Erico Nunes <nunes.erico@gmail.com>

d01ffcd9

lima/ppir: reserve register for undef operations · 621de4d8

Erico Nunes authored 5 years ago


Even though the value of undef operations doesn't really matter,
regalloc must ensure that a live register won't be used.
Otherwise, it may overwrite a live value and cause incorrect results.

Signed-off-by: Erico Nunes <nunes.erico@gmail.com>

621de4d8

lima/ppir: fix src read mask swizzling · cf730d18

Erico Nunes authored 5 years ago


The src mask can't be calculated from the dest write_mask.
Instead, it must be calculated from the swizzled operators of the src.
Otherwise, liveness calculation may report incorrect live components for
non-ssa registers.

Signed-off-by: Erico Nunes <nunes.erico@gmail.com>

cf730d18

Jan 18, 2020

panfrost: Dynamically allocate shader variants · d8a3501f

Icecream95 authored 5 years ago


This fixes a crash in LZDoom where over 16 shader variants are needed
for a few shaders in some maps, and should also save a few kilobytes
of RAM as most of the time only one or two variants of the 8 previously
allocated are actually needed.

Reviewed-by: Alyssa Rosenzweig <alyssa.rosenzweig@collabora.com>

d8a3501f

panfrost: Expose some functionality with dEQP flag · bef716b5

Alyssa Rosenzweig authored 5 years ago


These features are stable enough that they don't need to be hidden.

Signed-off-by: Alyssa Rosenzweig <alyssa.rosenzweig@collabora.com>
Tested-by: Marge Bot <mesa/mesa!3464>
Part-of: <mesa/mesa!3464>

bef716b5

pan/midgard: Fix recursive csel scheduling · 4af8d5b0

Alyssa Rosenzweig authored 5 years ago


Corner case causing invalid scheduling on shaders with nested csels,
i.e. GLSL code resembling:

   (foo ? bool1 : bool2) ? x : y

By explicitly disallowing csels this is fixed.

Fixes INSTR_INVALID_ENC on a glamor shader (noticeable with slowdown and
visual corruption when scrolling "too far" on GTK apps).

Signed-off-by: Alyssa Rosenzweig <alyssa.rosenzweig@collabora.com>
Tested-by: Marge Bot <mesa/mesa!3463>
Part-of: <mesa/mesa!3463>

4af8d5b0

panfrost: Identify un/pack colour opcodes · 564a782f

Alyssa Rosenzweig authored 5 years ago


We still need to identify formats in the disassembler, but this will at
least get the opcode name clear.

Signed-off-by: Alyssa Rosenzweig <alyssa.rosenzweig@collabora.com>
Tested-by: Marge Bot <mesa/mesa!3462>
Part-of: <mesa/mesa!3462>

564a782f

pan/midgard: Bytemasks should round up, not round down · 13c32e5f

Alyssa Rosenzweig authored 5 years ago


Otherwise we'll lost components in DCE.

Signed-off-by: Alyssa Rosenzweig <alyssa.rosenzweig@collabora.com>
Part-of: <mesa/mesa!3462>

13c32e5f

panfrost: Compact the bo_access readers array · 5e8386c6

Icecream95 authored 5 years ago


Previously, the array bo_access->readers was only cleared when there
were no unsignaled fences, which in some situations never happened.

That resulted in the array having thousands of NULL pointers, but only
a handful of active readers.

With this patch, all the unsignaled readers are moved to the front of
the array, effectively building a new array only containing the active
readers in-place. This results in the readers array usually only having
a couple of elements.

Reviewed-by: Boris Brezillon <boris.brezillon@collabora.com>
Reviewed-by: Alyssa Rosenzweig <alyssa.rosenzweig@collabora.com>
Tested-by: Marge Bot <mesa/mesa!3419>
Part-of: <mesa/mesa!3419>

5e8386c6

zink: support arrays of samplers · c0ba9000
Erik Faye-Lund authored 5 years ago
```
Tested-by: Marge Bot <mesa/mesa!3275>
Part-of: <mesa/mesa!3275>
```
c0ba9000
zink: support sampling non-float textures · a9023ec5
Erik Faye-Lund authored 5 years ago
```
Part-of: <mesa/mesa!3275>
```
a9023ec5
zink: store image-type per texture · 3e1acff5
Erik Faye-Lund authored 5 years ago
```
Part-of: <mesa/mesa!3275>
```
3e1acff5
zink: avoid incorrect vector-construction · 5fc1562a
Erik Faye-Lund authored 5 years ago
```
Part-of: <mesa/mesa!3275>
```
5fc1562a
zink: support offset-variants of texturing · 8112240d
Erik Faye-Lund authored 5 years ago
```
Part-of: <mesa/mesa!3275>
```
8112240d
zink: implement nir_texop_txs · f1a5bcdc
Erik Faye-Lund authored 5 years ago
```
Part-of: <mesa/mesa!3275>
```
f1a5bcdc

docs: fixup indentation · 7ee94d1b

Erik Faye-Lund authored 5 years ago


The most canonical indentation-style here is two spaces, which is what
the standard boilerplate in all documents use. So let's normalize to
that.

Reviewed-by: Eric Engestrom <eric@engestrom.ch>
Tested-by: Marge Bot <mesa/mesa!3443>
Part-of: <mesa/mesa!3443>

7ee94d1b

docs: remove pointless, stray newline · 2ef98947
Erik Faye-Lund authored 5 years ago
```
Reviewed-by: Eric Engestrom <eric@engestrom.ch>
Part-of: <mesa/mesa!3443>
```
2ef98947

docs: use [1] instead of asterisk for footnote · 199572b6

Erik Faye-Lund authored 5 years ago


While we're at it, make it a link as well.

Reviewed-by: Eric Engestrom <eric@engestrom.ch>
Part-of: <!3443>

199572b6

docs: remove trailing newlines · 063a2864
Erik Faye-Lund authored 5 years ago
```
Reviewed-by: Eric Engestrom <eric@engestrom.ch>
Part-of: <!3443>
```
063a2864

docs: remove leading spaces · 9954120b

Erik Faye-Lund authored 5 years ago


There's no good reason to have leading space in these pre-formatted
blocks. It looks strange, so let's get rid of it.

Reviewed-by: Eric Engestrom <eric@engestrom.ch>
Part-of: <!3443>

9954120b

docs: remove trailing header · c8718627

Erik Faye-Lund authored 5 years ago


This header has been there since the document was added, but contains
nothing. So let's get rid of it.

Reviewed-by: Eric Engestrom <eric@engestrom.ch>
Part-of: <mesa/mesa!3443>

c8718627

docs: use figure/figcaption instead of tables · 37daddd3
Erik Faye-Lund authored 5 years ago
```
Reviewed-by: Eric Engestrom <eric@engestrom.ch>
Part-of: <mesa/mesa!3443>
```
37daddd3

docs: do not use definition-list for sub-topics · f5983a6e

Erik Faye-Lund authored 5 years ago


The dl-tag isn't a neat tool for defining sub-headings, it's a semantic
tool for defining definitions and their meaning. Let's insetad use
normal sub-headings instead.

To make the last few paragraphs stand out from the above, let's add a
sub-heading for those as well.

Reviewed-by: Eric Engestrom <eric@engestrom.ch>
Part-of: <mesa/mesa!3443>

f5983a6e

Jan 17, 2020

freedreno/a6xx: add PROG_FB_RAST stateobj · 95187083

Rob Clark authored 5 years ago


For the handful of registers that depend on the union of program/
framebuffer/rasterizer state.

Signed-off-by: Rob Clark <robdclark@chromium.org>
Reviewed-by: Kristian H. Kristensen <hoegsberg@google.com>
Tested-by: Marge Bot <mesa/mesa!3435>
Part-of: <mesa/mesa!3435>

95187083

freedreno/a6xx: move dynamic program state to streaming stateobj · 6dc9b292

Rob Clark authored 5 years ago


Move the program state which we can't pre-bake to a streaming state
object, rather than emitting directly in the draw cmdstream.

Signed-off-by: Rob Clark <robdclark@chromium.org>
Reviewed-by: Kristian H. Kristensen <hoegsberg@google.com>
Part-of: <mesa/mesa!3435>

6dc9b292

freedreno/a6xx: drop a few more per-draw registers · d2fd6469

Rob Clark authored 5 years ago


Signed-off-by: Rob Clark <robdclark@chromium.org>
Reviewed-by: Kristian H. Kristensen <hoegsberg@google.com>
Part-of: <mesa/mesa!3435>

d2fd6469

freedreno/a6xx: separate rast stateobj for prim restart · 4d8f42c8

Rob Clark authored 5 years ago


This lets us move PC_PRIMITIVE_CNTL into the rasterizr stateobj, rather
than unconditionally emitting it directly in the cmdstream on every
draw.

This also starts adding some tracking about previous draw state, so that
following patches can limit some of the register writes we currently
emit on every draw.

Signed-off-by: Rob Clark <robdclark@chromium.org>
Reviewed-by: Kristian H. Kristensen <hoegsberg@google.com>
Part-of: <mesa/mesa!3435>

4d8f42c8

freedreno/a6xx: cleanup rasterizer state · 0e063b30

Rob Clark authored 5 years ago


All but one of the reg values is only used in the stateobj, so we can
inline the register value setup and stateobj construction.  While we
are at it, switch over to the new register builders.

Prep work for next patch.

Signed-off-by: Rob Clark <robdclark@chromium.org>
Reviewed-by: Kristian H. Kristensen <hoegsberg@google.com>
Part-of: <mesa/mesa!3435>

0e063b30

freedreno/a6xx: limit scratch/debug markers to debug builds · fba7e6f8

Rob Clark authored 5 years ago


The overhead does seem to matter when you have a high enough # of draw
calls that effect few bins/pixels, because these writes would happen
unconditionally (ie. not part of a state-group).

Possibly we could keep these if we moved them into a state-group so the
register writes would be no-ops on bins with no geometry.  OTOH I
usually end up adding in a WFI when using them scratch reg values to
track down a crash.  (So add a WFI to mitigate the annoyance of needing
to use a debug build to get scratch regs to locate the position of a
crash/hang in the cmdstream.)

Signed-off-by: Rob Clark <robdclark@chromium.org>
Reviewed-by: Kristian H. Kristensen <hoegsberg@google.com>
Part-of: <mesa/mesa!3435>

fba7e6f8

iris: Fix some indentation in iris_init_render_context · 5d7381c6
Jordan Justen authored 5 years ago
```
Signed-off-by: Jordan Justen <jordan.l.justen@intel.com>
```
5d7381c6
util/vector: Fix u_vector_foreach when head rolls over · c1104e4c
Craig Stout authored 5 years ago
```
Also add unit tests for u_vector.

Tested-by: Marge Bot <!3453>
Part-of: <!3453>
```
c1104e4c

intel/fs: Switch to standard vector layout for barycentrics at optimization time. · b54b67e0

Francisco Jerez authored 5 years ago


This involves permuting the registers of barycentric vectors to have
the standard X[0-n] Y[0-n] layout at NIR translation time.
Barycentrics are converted to the format expected by the PLN
instruction in the lower_barycentrics() pass run after the
optimization loop.

Main reason is correctness of SIMD32 fragment shaders.  The
shuffle_from_pln_layout() and shuffle_to_pln_layout() helpers used
during NIR translation are busted for SIMD32.  This leads to serious
corruption at present with INTEL_DEBUG=do32, especially on Gen11+
where these helpers are hit more frequently due to the lack of a
hardware PLN instruction.

Of course one could have chosen to fix those helpers instead, but
there is another far more subtle issue that was reported during review
of the SIMD32 fragment shader codegen changes: The SIMD splitting pass
currently handles SIMD32 barycentric vectors as if they had the
standard X[0-n] Y[0-n] layout, even though they are interleaved for
the PLN instruction, which causes incorrect execution masks to be
applied to the MOVs unzipping barycentric vectors in cases where a
LINTERP instruction occurs under non-uniform control flow.

I'm not aware of any conformance regressions due to the latter issue
at present, but for our peace of mind let's move the conversion to the
PLN layout into the lower_barycentrics() pass run after
lower_simd_width().

This leads to the following shader-db improvements (including SIMD32
shaders) in combination with the previous back-end preparation changes
-- Without them (especially the copy propagation changes) this would
lead to a massive number of regressions.  On ICL:

   total instructions in shared programs: 20662316 -> 20466903 (-0.95%)
   instructions in affected programs: 10538474 -> 10343061 (-1.85%)
   helped: 68775
   HURT: 6

   total spills in shared programs: 8938 -> 8748 (-2.13%)
   spills in affected programs: 376 -> 186 (-50.53%)
   helped: 9
   HURT: 5

   total fills in shared programs: 8965 -> 8663 (-3.37%)
   fills in affected programs: 965 -> 663 (-31.30%)
   helped: 9
   HURT: 6

   LOST:   146
   GAINED: 43

On SKL:

   total instructions in shared programs: 18725867 -> 18614912 (-0.59%)
   instructions in affected programs: 3876590 -> 3765635 (-2.86%)
   helped: 27492
   HURT: 2

   LOST:   191
   GAINED: 417

On SNB:

   total instructions in shared programs: 14573613 -> 13980646 (-4.07%)
   instructions in affected programs: 5199074 -> 4606107 (-11.41%)
   helped: 29998
   HURT: 0

   LOST:   21
   GAINED: 30

Results are somewhat less impressive but still significant without
SIMD32 fragment shaders enabled.  On ICL:

   total instructions in shared programs: 16148728 -> 16061659 (-0.54%)
   instructions in affected programs: 6114788 -> 6027719 (-1.42%)
   helped: 42046
   HURT: 6

   total spills in shared programs: 8218 -> 8028 (-2.31%)
   spills in affected programs: 376 -> 186 (-50.53%)
   helped: 9
   HURT: 5

   total fills in shared programs: 8953 -> 8651 (-3.37%)
   fills in affected programs: 965 -> 663 (-31.30%)
   helped: 9
   HURT: 6

   LOST:   0
   GAINED: 3

On SKL:

   total instructions in shared programs: 14927994 -> 14926738 (-0.01%)
   instructions in affected programs: 168850 -> 167594 (-0.74%)
   helped: 711
   HURT: 2

On SNB:

   total instructions in shared programs: 10770538 -> 10734403 (-0.34%)
   instructions in affected programs: 2702172 -> 2666037 (-1.34%)
   helped: 17818
   HURT: 0

All of the hurt shaders are either spilling slightly more or emitting
additional NOP instructions due to the SIMD16 POW workaround for
Gen8-9 combined with differences in scheduling.

Reviewed-by: Kenneth Graunke <kenneth@whitecape.org>

b54b67e0

intel/fs: Introduce barycentric layout lowering pass. · 79bd252d

Francisco Jerez authored 5 years ago


The goal is to represent barycentrics with the standard vector layout
during optimization and particularly SIMD lowering.  Instead of
emitting the barycentric layout conversions at NIR translation time,
do it later as a lowering pass.  For the moment this is only applied
to PI messages, but we'll give the same treatment to LINTERP
instructions too.

Reviewed-by: Kenneth Graunke <kenneth@whitecape.org>

79bd252d

intel/fs: Split fetch_payload_reg() into separate helper for barycentrics. · 44d7d66a

Francisco Jerez authored 5 years ago


We're about to change the layout of barycentric vectors, which will
involve permuting the GRFs of barycentrics fetched from the thread
payload.  Make room for this in a function separate from the generic
fetch_payload_reg(), since the permutation will only be applicable to
barycentric vectors.  This allows simplifying fetch_payload_reg(),
since there was no need for handling multiple-component payload
registers except for barycentrics.

This causes some minor shader-db noise due to the new helper emitting
a LOAD_PAYLOAD instruction unconditionally, but it will be cleaned up
shortly.

Reviewed-by: Kenneth Graunke <kenneth@whitecape.org>

44d7d66a

intel/fs/gen6: Use SEL instead of bashing thread payload for unlit centroid workaround. · 9c9e8010

Francisco Jerez authored 5 years ago


This prevents regressions on SNB due to the redundant MOVs lying
around in cases where fetch_payload_reg() returns a VGRF (currently
only in SIMD32 but soon in pretty much all cases).  The MOVs can't be
register-coalesced due to their source being a FIXED_GRF, and they
can't be copy-propagated either due to the unlit centroid workaround
partial writes.  They can be copy-propagated just fine into a SEL
instruction though.

On SNB this prevents the following shader-db regressions (including
SIMD32 programs) in combination with the interpolation rework part of
this series:

   total instructions in shared programs: 13996898 -> 14001982 (0.04%)
   instructions in affected programs: 197461 -> 202545 (2.57%)
   helped: 0
   HURT: 1251

Reviewed-by: Kenneth Graunke <kenneth@whitecape.org>

9c9e8010

intel/fs/gen6: Generalize aligned_pairs_class to SIMD16 aligned barycentrics. · 0dd18d70

Francisco Jerez authored 5 years ago


This is mainly meant to avoid shader-db regressions on SNB as we start
using VGRFs for barycentrics more frequently.  Currently the
aligned_pairs_class is only useful in SIMD8 mode, because in SIMD16
mode barycentric vectors are typically 4 GRFs.  This is not a problem
on Gen4-5, because on those platforms all VGRF allocations are
pair-aligned in SIMD16 mode.  However on Gen6 we end up using either
the fast or the slow path of LINTERP rather non-deterministically
based on the behavior of the register allocator.

Fix it by repurposing aligned_pairs_class to hold PLN-aligned
registers of whatever the natural size of a barycentric vector is in
the current dispatch width.

On SNB this prevents the following shader-db regressions (including
SIMD32 programs) in combination with the interpolation rework part of
this series:

   total instructions in shared programs: 13983257 -> 14527274 (3.89%)
   instructions in affected programs: 1766255 -> 2310272 (30.80%)
   helped: 0
   HURT: 11608

   LOST:   26
   GAINED: 13

Reviewed-by: Kenneth Graunke <kenneth@whitecape.org>

0dd18d70

intel/fs/gen6: Constrain barycentric source of LINTERP during bank conflict mitigation. · 0db4455c

Francisco Jerez authored 5 years ago

This avoids regressions on SNB due to the bank conflict mitigation
pass moving a VGRF-allocated barycentric vector to a misaligned
location, which would prevent the PLN instruction from being used.

Reviewed-by: Kenneth Graunke <kenneth@whitecape.org>

0db4455c

intel/fs/gen4-6: Allocate registers from aligned_pairs_class based on LINTERP use. · 369aef85

Francisco Jerez authored 5 years ago


Previously we would hardcode fs_visitor::delta_xy barycentrics to be
allocated from aligned_pairs_class on hardware with PLN source
alignment restrictions (pre-Gen7).  Instead allocate any registers
consumed by LINTERP from aligned_pairs_class, even if some barycentric
vector had ended up in a temporary.

On SNB this prevents the following shader-db regressions (including
SIMD32 programs) in combination with the interpolation rework part of
this series:

   total instructions in shared programs: 13983257 -> 14527274 (3.89%)
   instructions in affected programs: 1766255 -> 2310272 (30.80%)
   helped: 0
   HURT: 11608

   LOST:   26
   GAINED: 13

Reviewed-by: Kenneth Graunke <kenneth@whitecape.org>

369aef85

intel/fs: Allow limited copy propagation of a LOAD_PAYLOAD into another. · 54b1b71e

Francisco Jerez authored 5 years ago


This is particularly useful in cases where register coalaesce is
unlikely to succeed because the LOAD_PAYLOAD isn't a plain copy --
E.g. when a LOAD_PAYLOAD is shuffling the contents of a barycentric
vector in order to transform it into the PLN layout.

This prevents the following shader-db regressions (including SIMD32
programs) in combination with the interpolation rework part of this
series.  On SKL:

   total instructions in shared programs: 18596672 -> 18976097 (2.04%)
   instructions in affected programs: 7937041 -> 8316466 (4.78%)
   helped: 39
   HURT: 67427

   LOST:   466
   GAINED: 220

On SNB:

   total instructions in shared programs: 13993866 -> 14202963 (1.49%)
   instructions in affected programs: 7611309 -> 7820406 (2.75%)
   helped: 624
   HURT: 52943

   LOST:   6
   GAINED: 18

Reviewed-by: Kenneth Graunke <kenneth@whitecape.org>

54b1b71e

intel/fs: Add support for copy-propagating a block of multiple FIXED_GRFs. · 8eb4f209

Francisco Jerez authored 5 years ago


In cases where a LOAD_PAYLOAD instruction copies a single block of
sequential GRF registers into the destination (see
is_identity_payload()), splitting the block copy into a number of ACP
entries (one for each LOAD_PAYLOAD source) is undesirable, because
that prevents copy propagation into any instructions which read
multiple components at once with the same source (the barycentric
source of the LINTERP instruction is going to be the overwhelmingly
most common example).

Technically it would also be possible to do this for VGRF sources, but
there is little benefit from that since register coalesce already
covers many of those cases -- There is no way for a block of
FIXED_GRFs to be coalesced into a VGRF though.

This prevents the following shader-db regressions (including SIMD32
programs) in combination with the interpolation rework part of this
series.  On SKL:

   total instructions in shared programs: 18595160 -> 18828562 (1.26%)
   instructions in affected programs: 13374946 -> 13608348 (1.75%)
   helped: 7
   HURT: 108977

   total spills in shared programs: 9116 -> 9106 (-0.11%)
   spills in affected programs: 404 -> 394 (-2.48%)
   helped: 7
   HURT: 9

   total fills in shared programs: 8994 -> 9176 (2.02%)
   fills in affected programs: 898 -> 1080 (20.27%)
   helped: 7
   HURT: 9

   LOST:   469
   GAINED: 220

On SNB:

   total instructions in shared programs: 13996898 -> 14096222 (0.71%)
   instructions in affected programs: 8088546 -> 8187870 (1.23%)
   helped: 2
   HURT: 66520

   total spills in shared programs: 2985 -> 2961 (-0.80%)
   spills in affected programs: 632 -> 608 (-3.80%)
   helped: 2
   HURT: 0

   total fills in shared programs: 3144 -> 3128 (-0.51%)
   fills in affected programs: 1515 -> 1499 (-1.06%)
   helped: 2
   HURT: 0

   LOST:   0
   GAINED: 4

Reviewed-by: Kenneth Graunke <kenneth@whitecape.org>

8eb4f209

Admin message

Admin message