Commits · main · sewn / mesa

Dec 12, 2024

meson: Set dri configuration features as empty appropiately · d5adbef6
sewn authored 2 months ago

d5adbef6

aco/lower_branches: remove edges between blocks if there is no direct branch · 26a3038b

Daniel Schürmann authored 2 months ago and

Marge Bot committed 2 months ago

This way, linear predecessors and successors better reflect the
actual control flow which improves wait state insertion and hazard
mitigation.

Totals from 10252 (12.91% of 79395) affected shaders: (Navi31)

Instrs: 18824540 -> 18803823 (-0.11%); split: -0.11%, +0.00%
CodeSize: 99025464 -> 98942028 (-0.08%); split: -0.08%, +0.00%
Latency: 169291854 -> 165781877 (-2.07%); split: -2.07%, +0.00%
InvThroughput: 29701086 -> 29228602 (-1.59%); split: -1.59%, +0.00%
SClause: 510587 -> 510586 (-0.00%)
Part-of: <mesa/mesa!32389>

26a3038b

aco: move branch lowering optimization into separate file 'aco_lower_branches.cpp' · 22ffe720
Daniel Schürmann authored 3 months ago and Marge Bot committed 2 months ago
```
No fossil changes.

Part-of: <mesa/mesa!32389>
```
22ffe720

aco/lower_to_hw_instr: Check the right instruction's opcode · 845660f2

Natalie Vock authored 3 months ago and

Marge Bot committed 2 months ago

instr is the branch instruction, its opcode won't ever be writelane. We
should check inst instead.

Found by inspection.

Cc: mesa-stable
Part-of: <mesa/mesa!32389>

845660f2

aco/jump_threading: remove branch sequence optimization · 28ab7f01
Daniel Schürmann authored 3 months ago and Marge Bot committed 2 months ago
```
This optimization gets applied during postRA optimization, now.

No fossil changes.

Part-of: <mesa/mesa!32330>
```
28ab7f01

aco: move try_optimize_branching_sequence() to postRA optimizations · fcd94a8c

Daniel Schürmann authored 3 months ago and

Marge Bot committed 2 months ago

Totals from 196 (0.25% of 79206) affected shaders: (Navi31)

Instrs: 534343 -> 534438 (+0.02%); split: -0.00%, +0.02%
CodeSize: 2774852 -> 2775420 (+0.02%); split: -0.00%, +0.02%
Latency: 7103512 -> 7103021 (-0.01%); split: -0.01%, +0.00%
InvThroughput: 959477 -> 959447 (-0.00%)
Copies: 42646 -> 42648 (+0.00%)
Part-of: <mesa/mesa!32330>

fcd94a8c

aco/optimizer_postRA: set branch()->never_taken if exec is constant non-zero · 95d44c7c
Daniel Schürmann authored 2 months ago and Marge Bot committed 2 months ago
```
Part-of: <mesa/mesa!32330>
```
95d44c7c
aco/print_ir: don't print disconnected empty blocks · d67932f6
Daniel Schürmann authored 2 months ago and Marge Bot committed 2 months ago
```
Part-of: <mesa/mesa!32330>
```
d67932f6

anv: document UBO descriptor range alignments · 2bb98a8f

Lionel Landwerlin authored 2 months ago and

Marge Bot committed 2 months ago


Signed-off-by: Lionel Landwerlin <lionel.g.landwerlin@intel.com>
Reviewed-by: Kenneth Graunke <kenneth@whitecape.org>
Part-of: <mesa/mesa!32347>

2bb98a8f

intel/decoder: fix COMPUTE_WALKER handling · 99bb2a08

Lionel Landwerlin authored 2 months ago and

Marge Bot committed 2 months ago


Signed-off-by: Lionel Landwerlin <lionel.g.landwerlin@intel.com>
Fixes: 17096f87 ("intel: Switch to COMPUTE_WALKER_BODY")
Reviewed-by: Kenneth Graunke <kenneth@whitecape.org>
Part-of: <!32347>

99bb2a08

brw: Combine convergent texture buffer fetches into fewer loads · 6341b3cd

Kenneth Graunke authored 2 months ago and

Marge Bot committed 2 months ago


Borderlands 3 (both DX11 and DX12 renderers) have a common pattern
across many shaders:

  con 32x4 %510 = (uint32)txf %2 (handle), %1191 (0x10) (coord), %1 (0x0) (lod), 0 (texture)
  con 32x4 %512 = (uint32)txf %2 (handle), %1511 (0x11) (coord), %1 (0x0) (lod), 0 (texture)
  ...
  con 32x4 %550 = (uint32)txf %2 (handle), %1549 (0x25) (coord), %1 (0x0) (lod), 0 (texture)
  con 32x4 %552 = (uint32)txf %2 (handle), %1551 (0x26) (coord), %1 (0x0) (lod), 0 (texture)

A single basic block contains piles of texelFetches from a 1D buffer
texture, with constant coordinates.  In most cases, only the .x channel
of the result is read.  So we have something on the order of 28 sampler
messages, each asking for...a single uint32_t scalar value.  Because our
sampler doesn't have any support for convergent block loads (like the
untyped LSC transpose messages for SSBOs)...this means we were emitting
SIMD8/16 (or SIMD16/32 on Xe2) sampler messages for every single scalar,
replicating what's effectively a SIMD1 value to the entire register.
This is hugely wasteful, both in terms of register pressure, and also in
back-and-forth sending and receiving memory messages.

The good news is we can take advantage of our explicit SIMD model to
handle this more efficiently.  This patch adds a new optimization pass
that detects a series of SHADER_OPCODE_TXF_LOGICAL, in the same basic
block, with constant offsets, from the same texture.  It constructs a
new divergent coordinate where each channel is one of the constants
(i.e <10, 11, 12, ..., 26> in the above example).  It issues a new
NoMask divergent texel fetch which loads N useful channels in one go,
and replaces the rest with expansion MOVs that splat the SIMD1 result
back to the full SIMD width.  (These get copy propagated away.)

We can pick the SIMD size of the load independently of the native shader
width as well.  On Xe2, those 28 convergent loads become a single SIMD32
ld message.  On earlier hardware, we use 2 SIMD16 messages.  Or we can
use a smaller size when there aren't many to combine.

In fossil-db, this cuts 27% of send messages in affected shaders, 3-6%
of cycles, 2-3% of instructions, and 8-12% of live registers.  On A770,
this improves performance of Borderlands 3 by roughly 2.5-3.5%.

Reviewed-by: Lionel Landwerlin <lionel.g.landwerlin@intel.com>
Part-of: <!32573>

6341b3cd

Dec 11, 2024

aco/assembler: Don't emit target basic block index when chaining branches · 22881712

Daniel Schürmann authored 2 months ago and

Marge Bot committed 2 months ago

This could erroneously cause an assertion to fail if the
target block index was larger than UINT16_MAX.

Fixes: cab5639a ('aco/assembler: chain branches instead of emitting long jumps')
Part-of: <mesa/mesa!32599>

22881712

panvk/ci: update g52-vk-full job · 445ff2e5

Erik Faye-Lund authored 2 months ago and

Marge Bot committed 2 months ago


On a single runner, this job currently times out due to taking over 5
hours. The estimate from dEQP runner itself suggests a full run might
take over 8 hours with the current configuration. We can't really work
with that long runs, even if they are manual.

We currently have 7 vim3 runners, so we can actually afford to
parallelize the run a bit, to make this a bit more manageable. If we
choose 4, we take up a bit more than half of the runners, but we leave
two runners (plus a spare) for the pre-merge CI.

With this, a each job takes about 2.5 hours. We leave the timeout at 3
hours for now, to have some headroom for new tests being enabled.

Acked-by: Daniel Stone <daniels@collabora.com>
Part-of: <!32591>

445ff2e5

panvk/ci: update g52 results · bdbcd7c7
Erik Faye-Lund authored 2 months ago and Marge Bot committed 2 months ago
```
Acked-by: Daniel Stone <daniels@collabora.com>
Part-of: <!32591>
```
bdbcd7c7
panvk/ci: remove duplicate skips · 8b969d78
Erik Faye-Lund authored 2 months ago and Marge Bot committed 2 months ago
```
Acked-by: Daniel Stone <daniels@collabora.com>
Part-of: <!32591>
```
8b969d78
intel/compiler: Use #pragma once instead of header guards · abe41b1d
Caio Oliveira authored 2 months ago and Marge Bot committed 2 months ago
```
Acked-by: Matt Turner <mattst88@gmail.com>
Part-of: <!32534>
```
abe41b1d

amd: add GFX v11.5.3 support · ad75b9f1

Tim Huang authored 2 months ago and

Marge Bot committed 2 months ago


This enables support for GFX version 11.5.3.

Signed-off-by: Tim Huang <tim.huang@amd.com>
Reviewed-by: Marek Olšák <marek.olsak@amd.com>
Part-of: <mesa/mesa!32567>

ad75b9f1

util/format: nr_channels is always <= 4 · 5b42da1b

Juan A. Suárez authored 4 months ago and

Marge Bot committed 2 months ago


While the nr_channels is defined with 3 bits, which allows up to 7
channels, actually the number of channels is less or equal to 4.

This adds an assertion that helps static analyzers to avoid several
false positives related with this.

Reviewed-by: Erik Faye-Lund <erik.faye-lund@collabora.com>
Signed-off-by: Juan A. Suarez Romero <jasuarez@igalia.com>
Part-of: <!32589>

5b42da1b

radv: remove remaining discard to demote options · 167f4a87

Samuel Pitoiset authored 2 months ago and

Marge Bot committed 2 months ago


This is the default but the option wasn't completely removed.

Signed-off-by: Samuel Pitoiset <samuel.pitoiset@gmail.com>
Part-of: <!32590>

167f4a87

intel/dev: update mesa_defs.json from internal database · 97fc9874

Tapani Pälli authored 2 months ago and

Marge Bot committed 2 months ago


This updates entry for 14017823839 which fixes issues on BMG with:
   dEQP-VK.compute.pipeline.zero_initialize_workgroup_memory.max_workgroup_memory.1

Signed-off-by: Tapani Pälli <tapani.palli@intel.com>
Reviewed-by: José Roberto de Souza <jose.souza@intel.com>
Part-of: <!32550>

97fc9874

panvk: advertise version 1.1 support · a6e03ce4

Eric Smith authored 2 months ago and

Marge Bot committed 2 months ago


We know we have a broken Vulkan driver, so it's debatable whether it's
a broken Vulkan 1.0 or broken 1.1. Advertising 1.1 lets us run more
tests, and this patch does this. We also bump the instance version id
to 1.4, which seems appropriate since the overall Vulkan infrastructure
within Mesa is at that level.

Reviewed-by: Erik Faye-Lund <erik.faye-lund@collabora.com>
Part-of: <!32464>

a6e03ce4

panvk: split device and instance version numbers · 2627d793

Eric Smith authored 2 months ago and

Marge Bot committed 2 months ago


We were using the same routine to find the device and instance
version numbers. This isn't correct; the device version may
vary based on the physical hardware we are using, but the
instance version should always be the same.

Reviewed-by: Erik Faye-Lund <erik.faye-lund@collabora.com>
Part-of: <!32464>

2627d793

panvk: update feature support · 605c173f

Eric Smith authored 2 months ago and

Marge Bot committed 2 months ago


Turn on `imageCubeArray` and `fragmentStoresAndAtomics`, which we
already support (the latter only on v10 and later).

Reviewed-by: Erik Faye-Lund <erik.faye-lund@collabora.com>
Part-of: <!32464>

605c173f

ir3/cp: add support for swapping srcs of sad · f80ac64e

Job Noorman authored 2 months ago and

Marge Bot committed 2 months ago


Like mad, it's sometimes useful to swap the srcs of sad since not all
flags are allowed on all srcs. However, unlike mad, sad is 3-src
commutative so more srcs can be swapped.

Signed-off-by: Job Noorman <jnoorman@igalia.com>
Part-of: <!32501>

f80ac64e

ir3/cp: make try_swap_mad_two_srcs more generic · ea2a75f8

Job Noorman authored 2 months ago and

Marge Bot committed 2 months ago


In preparation for supporting sad, rename to try_swap_cat3_two_srcs and
add argument for src n.

Signed-off-by: Job Noorman <jnoorman@igalia.com>
Part-of: <!32501>

ea2a75f8

ir3/cp: extract common src swapping code · 00656526

Job Noorman authored 2 months ago and

Marge Bot committed 2 months ago


In preparation for supporting sad (which like mad may benefit from
swapping some of it srcs), extract the swapping from
try_swap_mad_two_srcs so that it can be reused for sad. This is
necessary since, unlike mad, sad might also benefit from swapping srcs
1->2 (instead of only 2->1) or 3->2.

Signed-off-by: Job Noorman <jnoorman@igalia.com>
Part-of: <!32501>

00656526

ir3/cp: only mark mad srcs as swapped when swap succeeded · e615f30b

Job Noorman authored 2 months ago and

Marge Bot committed 2 months ago


We would mark mad srcs as swapped once we tried swapping them, even if
it would not succeed. However, it might happen (especially after running
ir3_shared_folding) that a new opportunity for swapping comes up later.
Therefore, we should only mark the srcs as swapped when it actually
succeeded.

Signed-off-by: Job Noorman <jnoorman@igalia.com>
Part-of: <!32501>

e615f30b

ir3: add codegen for sad · 2573c1d7

Job Noorman authored 2 months ago and

Marge Bot committed 2 months ago


Turns out that sad is just iadd3. I assume it's an acronym for "Sum of
Absolute Differences" which may make sense since its 2nd src supports
(neg) which would allow SAD to be implemented using this instruction.

NIR already supports algebraic patterns for selecting iadd3 so adding
codegen support in ir3 is trivial. However, sad seems to have the same
hardware limitation as mad and doesn't support the scalar ALU so we have
to make sure to disable it when emitting iadd3.

Signed-off-by: Job Noorman <jnoorman@igalia.com>
Part-of: <!32501>

2573c1d7

ir3: teach backend about sad · ed58a868

Job Noorman authored 2 months ago and

Marge Bot committed 2 months ago


It only supports (neg) in its 2nd src but other than that has the same
properties as mad.

Signed-off-by: Job Noorman <jnoorman@igalia.com>
Part-of: <!32501>

ed58a868

ir3/isa: fix isaspec for sad.s32 · 49c7a22a

Job Noorman authored 2 months ago and

Marge Bot committed 2 months ago


FULL should be true here. Also tested in computerator and the comment
about uncertainty can be removed.

Signed-off-by: Job Noorman <jnoorman@igalia.com>
Part-of: <!32501>

49c7a22a

ir3/isa: fix cat3-alt immed src · 943f666b

Job Noorman authored 2 months ago and

Marge Bot committed 2 months ago


The override used for the immed encoding in #cat3-src-const-or-immed
used a pattern which isn't supported in overrides by isaspec. The
pattern in the base bitset (10) was too strict for immediates since it
didn't allow the most significant bit to be 1.

Fix this by making the base pattern 1 and adding an assert for the next
bit to be 0 in the non-immed case.

Signed-off-by: Job Noorman <jnoorman@igalia.com>
Fixes: 1c6c200c ("ir3: add newly found shlg.b16 instruction")
Part-of: <!32549>

943f666b

format: Add R8_G8B8_422_UNORM format · 6f958705

Eric Smith authored 2 months ago and

Marge Bot committed 2 months ago


This is the format that drivers will want to use for NV16
without YUV conversion (if they support this natively).
Previously we had NV16 working but it was always emulated
with R8 + GR88.

Fixes: 440b6921 ("dri, mesa: fix NV16 texture format")
Reviewed-by: Erik Faye-Lund <erik.faye-lund@collabora.com>
Part-of: <!32524>

6f958705

nir: make ballot ALU and mbcnt_amd operations reorderable · 26790e90

Rhys Perry authored 2 months ago and

Marge Bot committed 2 months ago


These can be lowered to ALU and load_subgroup_invocation, all of which are
reorderable.

Signed-off-by: Rhys Perry <pendingchaos02@gmail.com>
Reviewed-by: Alyssa Rosenzweig <alyssa@rosenzweig.io>
Reviewed-by: Georg Lehmann <dadschoorse@gmail.com>
Part-of: <!32512>

26790e90

nir/move_discards_to_top: don't move across more intrinsics · 650468fb

Rhys Perry authored 2 months ago and

Marge Bot committed 2 months ago


This missed dpp16_shift_amd, lane_permute_16_amd, last_invocation and
ballot_relaxed.

Instead, list the non-reorderable intrinsics which are allowed to be moved
after discards.

Signed-off-by: Rhys Perry <pendingchaos02@gmail.com>
Reviewed-by: Georg Lehmann <dadschoorse@gmail.com>
Part-of: <!32512>

650468fb

nir: make load_helper_invocation non-reorderable · 5368569d

Rhys Perry authored 2 months ago and

Marge Bot committed 2 months ago


This can't be moved to after demote, so it's not reorderable.

Signed-off-by: Rhys Perry <pendingchaos02@gmail.com>
Reviewed-by: Alyssa Rosenzweig <alyssa@rosenzweig.io>
Reviewed-by: Georg Lehmann <dadschoorse@gmail.com>
Part-of: <!32512>

5368569d

panvk: expose scalarBlockLayout · d1357b1e

Erik Faye-Lund authored 2 months ago and

Marge Bot committed 2 months ago


This just works on Mali, nothing fancy needed.

Unfortunately, this triggers a lot of timeouts, presumably due to
uncached CPU access to memory. So lots of extra skips here.

Reviewed-by: Boris Brezillon <boris.brezillon@collabora.com>
Part-of: <!32562>

d1357b1e

aco/ra: don't write to scc/ttmp with s_fmac · 65506e63

Georg Lehmann authored 2 months ago and

Marge Bot committed 2 months ago


Fixes: 4bd229ac ("aco/gfx11.5: select SOP2 float instructions")

Reviewed-by: Rhys Perry <pendingchaos02@gmail.com>
Part-of: <!32545>

65506e63

aco/ra: disallow s_cmpk with scc operand · 0b9e2a54

Georg Lehmann authored 2 months ago and

Marge Bot committed 2 months ago


Fixes: 2d6b0a41 ("aco/optimizer: Optimize SOPC with literal to SOPK.")

Reviewed-by: Rhys Perry <pendingchaos02@gmail.com>
Part-of: <!32545>

0b9e2a54

aco/ra: don't write to exec/ttmp with mulk/addk/cmovk · fe0c72ca

Georg Lehmann authored 2 months ago and

Marge Bot committed 2 months ago


ttmp sgprs are readonly outside of trap handlers, so the instructions were
probably skipped. RA should also never create additional exec writes.

Fixes: e0677328 ("aco/ra: Optimize some SOP2 instructions with literal to SOPK.")

Reviewed-by: Rhys Perry <pendingchaos02@gmail.com>
Part-of: <!32545>

fe0c72ca

aco/gfx12: don't assume memory operations complete in order · 576a2e79
Georg Lehmann authored 2 months ago and Marge Bot committed 2 months ago
```
Reviewed-by: Rhys Perry <pendingchaos02@gmail.com>
Part-of: <!32569>
```
576a2e79

Admin message

Admin message