Commits · wip/VK_EXT_transform_feedback · Faith Ekstrand / mesa

Jan 18, 2019

anv: Implement transform feedback queries · 0f69fd80
Faith Ekstrand authored Sep 14, 2018
```
Reviewed-by: Lionel Landwerlin <lionel.g.landwerlin@intel.com>
```
0f69fd80
genxml: Add SO_PRIM_STORAGE_NEEDED and SO_NUM_PRIMS_WRITTEN · b81db7f8
Faith Ekstrand authored Sep 14, 2018
```
Reviewed-by: Lionel Landwerlin <lionel.g.landwerlin@intel.com>
```
b81db7f8
anv: Implement CmdBegin/EndQueryIndexed · 5e6780fc
Faith Ekstrand authored Sep 14, 2018
```
Reviewed-by: Lionel Landwerlin <lionel.g.landwerlin@intel.com>
```
5e6780fc

anv: Implement vkCmdDrawIndirectByteCountEXT · e7e3465e

Faith Ekstrand authored Sep 14, 2018

Annoyingly, this requires that we implement integer division on the
command streamer. Fortunately, we're only ever dividing by constants so
we can use the mulh+add+shift trick and it's not as bad as it sounds.

Reviewed-by: Lionel Landwerlin <lionel.g.landwerlin@intel.com>

e7e3465e

anv: Implement the basic form of VK_EXT_transform_feedback · 3a59fab8
Faith Ekstrand authored Sep 10, 2018
```
Reviewed-by: Lionel Landwerlin <lionel.g.landwerlin@intel.com>
```
3a59fab8
anv: Add pipeline cache support for xfb_info · 23c4a0ab
Faith Ekstrand authored Sep 12, 2018
```
Reviewed-by: Lionel Landwerlin <lionel.g.landwerlin@intel.com>
```
23c4a0ab
nir: Preserve offsets in lower_io_to_scalar_early · 50b66e36
Faith Ekstrand authored Sep 14, 2018

50b66e36

nir: fix lowering arrays to elements for TFB outputs · 24432d32

Samuel Pitoiset authored Sep 10, 2018 and

Faith Ekstrand committed Jan 18, 2019



If we have a transform feedback output like:

float[2] x2_out (VARYING_SLOT_VAR1.x, 0, 0)

which is lowered by nir_lower_io_arrays_to_elements to,

float x2_out (VARYING_SLOT_VAR1.x, 0, 0)
float x2_out@5 (VARYING_SLOT_VAR2.x, 0, 0)

We have to update the destination offset to avoid overwriting
the same value.

Signed-off-by: Samuel Pitoiset <samuel.pitoiset@gmail.com>

24432d32

nir: do not remove varyings used for transform feedback · 3d28defe

Samuel Pitoiset authored May 11, 2018 and

Faith Ekstrand committed Jan 18, 2019



When a xfb buffer is explicitely declared on a varying
variable, we shouldn't remove it at link time.

Signed-off-by: Samuel Pitoiset <samuel.pitoiset@gmail.com>

3d28defe

anv: Add but do not enable VK_EXT_transform_feedback · 5bed147e
Faith Ekstrand authored Sep 10, 2018

5bed147e
nir/xfb: Properly handle arrays of blocks · d79b3e27
Faith Ekstrand authored Jan 08, 2019

d79b3e27

nir/xfb: don't assert when xfb_buffer/stride is present but not xfb_offset · 691f8afe

Alejandro Piñeiro authored Oct 22, 2018 and

Faith Ekstrand committed Jan 18, 2019



In order to allow nir_gather_xfb_info to be used on OpenGL,
specifically ARB_gl_spirv.

So, from OpenGL 4.6 spec, section 11.1.2.1, "Output Variables":

    "outputs specifying both an *XfbBuffer* and an *Offset* are
     captured, while outputs not specifying both of these are not
     captured. Values are captured each time the shader writes to such
     a decorated object."

This implies that are captured if both are present, and not if one of
those are lacking. Technically, it doesn't explicitly point that
having just one or the other is a mistake. In some cases, glslang is
adding some extra XfbBuffer without XfbOffset around, and mentioning
that technically that is not a bug (see issue#1526)

And for the case of Vulkan, as the same glslang issue mentions, it is
not clear if that should be a mistake or not. But even if it is a
mistake, it is not really needed to be checked on the driver, and we
can let the validation layers to check that.

v2: simplify explicit_xfb_buffer and explicit_offset checks (Jason).

Reviewed-by: Jason Ekstrand <jason@jlekstrand.net>

691f8afe

nir/xfb: Fix offset accounting for dvec3/4 · c0fd2b57

Faith Ekstrand authored Dec 06, 2018

Before, we were double-counting the component slots when we had a dvec3
or dvec4.  Instead, just add them in once and manually offset the
recorded output offset.

c0fd2b57

spirv: Only set interface_type on blocks · 76d8af49

Faith Ekstrand authored Jan 16, 2019

Instead of setting interface_type to whatever the per-vertex type is, we
only set it on blocks. This allows later passes to tell the difference
between variables that are in blocks and those that aren't.

76d8af49

spirv: Only split blocks · fedac1fd

Faith Ekstrand authored Jan 16, 2019

Instead of splitting every per-vertex struct, just split the ones that
are actually blocks. The reason for the split is so that we have
separate variables for separate locations, qualifiers, and builtin
decorations. The vulkan spec only allows these on members of blocks.

fedac1fd

spirv: Don't strip layout decorations off I/O var types · 1540913c
Faith Ekstrand authored Jan 08, 2019
```
We need the offset decorations for XFB.
```
1540913c
spirv: Initialize struct member offsets to -1 · a5e67157
Faith Ekstrand authored Jan 08, 2019
```
This is the "no offset specified" value.
```
a5e67157

anv: Always emit at least one vertex element · 85cb5954

Faith Ekstrand authored Jan 08, 2019

This seems to make the simulator happier. The early return wasn't
really protecting anything and the code that follows will happily
initialize the dummy element to STORE_0 and emit it.

85cb5954

anv: Implement VK_EXT_conditional_rendering for gen 7.5+ · 1952fd8d

Danylo Piliaiev authored Oct 05, 2018 and

Faith Ekstrand committed Jan 18, 2019



Conditional rendering affects next functions:
- vkCmdDraw, vkCmdDrawIndexed, vkCmdDrawIndirect, vkCmdDrawIndexedIndirect
- vkCmdDrawIndirectCountKHR, vkCmdDrawIndexedIndirectCountKHR
- vkCmdDispatch, vkCmdDispatchIndirect, vkCmdDispatchBase
- vkCmdClearAttachments

Value from conditional buffer is cached into designated register,
MI_PREDICATE is emitted every time conditional rendering is enabled
and command requires it.

v2: by Jason Ekstrand
  - Use vk_find_struct_const instead of manually looping
  - Move draw count loading to prepare function
  - Zero the top 32-bits of MI_ALU_REG15

v3: Apply pipeline flush before accessing conditional buffer
 (The issue was found by Samuel Iglesias)

v4: - Remove support of Haswell due to possible hardware bug
    - Made TMP_REG_PREDICATE and TMP_REG_DRAW_COUNT defines to
       define registers in one place.

v5: thanks to Jason Ekstrand and Lionel Landwerlin
    - Workaround the fact that MI_PREDICATE_RESULT is not
      accessible on Haswell by manually calculating
      MI_PREDICATE_RESULT and re-emitting MI_PREDICATE
      when necessary.

v6: suggested by Lionel Landwerlin
    - Instead of calculating the result of predicate once - re-emit
      MI_PREDICATE to make it easier to investigate error states.

v7: suggested by Jason
    - Make anv_pipe_invalidate_bits_for_access_flag add CS_STALL
      if VK_ACCESS_CONDITIONAL_RENDERING_READ_BIT is set.

v8: suggested by Lionel
    - Precompute conditional predicate's result to
      support secondary command buffers.
    - Make prepare_for_draw_count_predicate more readable.

Signed-off-by: Danylo Piliaiev <danylo.piliaiev@globallogic.com>
Reviewed-by: Lionel Landwerlin <lionel.g.landwerlin@intel.com>
Reviewed-by: Jason Ekstrand <jason@jlekstrand.net>

1952fd8d

anv: Implement VK_KHR_draw_indirect_count for gen 7+ · ed6e2bf2

Danylo Piliaiev authored Oct 05, 2018 and

Faith Ekstrand committed Jan 18, 2019



v2: by Jason Ekstrand
  - Move out of the draw loop population of registers
    which aren't changed in it.
  - Remove dependency on ALU registers.
  - Clarify usage of PIPE_CONTROL
  - Without usage of ALU registers patch works for gen7+

v3: set pending_pipe_bits |= ANV_PIPE_RENDER_TARGET_WRITES

Signed-off-by: Danylo Piliaiev <danylo.piliaiev@globallogic.com>
Reviewed-by: Lionel Landwerlin <lionel.g.landwerlin@intel.com>
Reviewed-by: Jason Ekstrand <jason@jlekstrand.net>

ed6e2bf2

bin/meson-cmd-extract: Also handle cross and native files · 9e989b86

Dylan Baker authored Jan 16, 2019



Native file support in command line serialization isn't present in meson
0.49, but will be for 0.49.1 and 0.50

Reviewed-by: Eric Engestrom <eric.engestrom@intel.com>

9e989b86

anv: Re-sort the extensions list · b54df1b6

Faith Ekstrand authored Jan 18, 2019



I like to keep things in good order so that you can find them.

Reviewed-by: Lionel Landwerlin <lionel.g.landwerlin@intel.com>

b54df1b6

intel/fs: Don't touch accumulator destination while applying regioning alignment rule · eb32dad0

Faith Ekstrand authored Jan 16, 2019

In some shaders, you can end up with a stride in the source of a
SHADER_OPCODE_MULH. One way this can happen is if the MULH is acting on
the top bits of a 64-bit value due to 64-bit integer lowering. In this
case, the compiler will produce something like this:

mul(8) acc0<1>UD g5<8,4,2>UD 0x0004UW { align1 1Q };
mach(8) g6<1>UD g5<8,4,2>UD 0x00000004UD { align1 1Q AccWrEnable };

The new region fixup pass looks at the MUL and sees a strided source and
unstrided destination and determines that the sequence is illegal. It
then attempts to fix the illegal stride by replacing the destination of
the MUL with a temporary and emitting a MOV into the accumulator:

mul(8) g9<2>UD g5<8,4,2>UD 0x0004UW { align1 1Q };
mov(8) acc0<1>UD g9<8,4,2>UD { align1 1Q };
mach(8) g6<1>UD g5<8,4,2>UD 0x00000004UD { align1 1Q AccWrEnable };

Unfortunately, this new sequence isn't correct because MOV accesses the
accumulator with a different precision to MUL and, instead of filling
the bottom 32 bits with the source and zeroing the top 32 bits, it
leaves the top 32 (or maybe 31) bits alone and full of garbage. When
the MACH comes along and tries to complete the multiplication, the
result is correct in the bottom 32 bits (which we throw away) and
garbage in the top 32 bits which are actually returned by MACH.

This commit does two things: First, it adds an assert to ensure that we
don't try to rewrite accumulator destinations of MUL instructions so we
can avoid this precision issue. Second, it modifies
required_dst_byte_stride to require a tightly packed stride so that we
fix up the sources instead and the actual code which gets emitted is
this:

mov(8) g9<1>UD g5<8,4,2>UD { align1 1Q };
mul(8) acc0<1>UD g9<8,8,1>UD 0x0004UW { align1 1Q };
mach(8) g6<1>UD g5<8,4,2>UD 0x00000004UD { align1 1Q AccWrEnable };

Fixes: efa4e4bc "intel/fs: Introduce regioning lowering pass"
Reviewed-by: Francisco Jerez <currojerez@riseup.net>

eb32dad0

intel/eu: Stop overriding exec sizes in send_indirect_message · 0a7ac6d5

Faith Ekstrand authored Jan 12, 2019



For a long time, we based exec sizes on destination register widths.
We've not been doing that since 1ca3a944 but a few remnants
accidentally remained.

Reviewed-by: Anuj Phogat <anuj.phogat@gmail.com>

0a7ac6d5

radv: initialize the per-queue descriptor BO only once · f682ed11

Samuel Pitoiset authored Jan 17, 2019



Totally useless to write the descriptors inside the loop.

Signed-off-by: Samuel Pitoiset <samuel.pitoiset@gmail.com>
Reviewed-by: Bas Nieuwenhuizen <bas@basnieuwenhuizen.nl>

f682ed11

radv: do not write unused descriptors to the per-queue BO · 72d9745a

Samuel Pitoiset authored Jan 17, 2019



Signed-off-by: Samuel Pitoiset <samuel.pitoiset@gmail.com>
Reviewed-by: Bas Nieuwenhuizen <bas@basnieuwenhuizen.nl>

72d9745a

radv: reduce size of the per-queue descriptor BO · 8c164ea8

Samuel Pitoiset authored Jan 17, 2019



Signed-off-by: Samuel Pitoiset <samuel.pitoiset@gmail.com>
Reviewed-by: Bas Nieuwenhuizen <bas@basnieuwenhuizen.nl>

8c164ea8

radv: drop unused code related to 16 sample locations · 83cc87ea

Samuel Pitoiset authored Jan 17, 2019



The driver only supports up to 8 sample locations.

Signed-off-by: Samuel Pitoiset <samuel.pitoiset@gmail.com>
Reviewed-by: Bas Nieuwenhuizen <bas@basnieuwenhuizen.nl>

83cc87ea

gm107/ir: disable TEXS for tex with derivAll set · 80dae702

Karol Herbst authored Jan 18, 2019

fixes deqp tests:
dEQP-GLES3.functional.shaders.texture_functions.texturegrad.samplercube_fixed_vertex
dEQP-GLES3.functional.shaders.texture_functions.texturegrad.samplercube_float_vertex
dEQP-GLES3.functional.shaders.texture_functions.texturegrad.isamplercube_vertex
dEQP-GLES3.functional.shaders.texture_functions.texturegrad.usamplercube_vertex
dEQP-GLES3.functional.shaders.texture_functions.texturegrad.sampler3d_fixed_vertex
dEQP-GLES3.functional.shaders.texture_functions.texturegrad.sampler3d_float_vertex
dEQP-GLES3.functional.shaders.texture_functions.texturegrad.isampler3d_vertex
dEQP-GLES3.functional.shaders.texture_functions.texturegrad.usampler3d_vertex
dEQP-GLES3.functional.shaders.texture_functions.texturegrad.sampler2dshadow_vertex
dEQP-GLES3.functional.shaders.texture_functions.textureprojgrad.sampler3d_fixed_vertex
dEQP-GLES3.functional.shaders.texture_functions.textureprojgrad.sampler3d_float_vertex
dEQP-GLES3.functional.shaders.texture_functions.textureprojgrad.isampler3d_vertex
dEQP-GLES3.functional.shaders.texture_functions.textureprojgrad.usampler3d_vertex
dEQP-GLES3.functional.shaders.texture_functions.textureprojgrad.sampler2dshadow_vertex

Fixes: f821e802
"gm107/ir: use scalar tex instructions where possible"
Signed-off-by: Karol Herbst <kherbst@redhat.com>
Reviewed-by: Ilia Mirkin <imirkin@alum.mit.edu>

80dae702

nv50/ir: disable tryCollapseChainedMULs in ConstantFolding for precise instructions · 30b5c9ed

Karol Herbst authored Jan 17, 2019



fixes dEQP-GLES2.functional.shaders.invariance.mediump.loop_3

CC: <mesa-stable@lists.freedesktop.org>
Signed-off-by: Karol Herbst <kherbst@redhat.com>
Reviewed-by: Ilia Mirkin <imirkin@alum.mit.edu>

30b5c9ed

Jan 17, 2019

nir: Account for atomics in copy propagation. · 8424cd8f

Bas Nieuwenhuizen authored Jan 17, 2019



Otherwise writes get propagated across atomics if no barrier is
used. Without barrier writes should still be visible in the same
invocation, so an atomic has to be considered a write.

CC: <mesa-stable@lists.freedesktop.org>
Reviewed-by: Caio Marcelo de Oliveira Filho <caio.oliveira@intel.com>
Fixes: b3c61469 "nir: Copy propagation between blocks"
Fixes: 62332d13 "nir: Add a local variable-based copy propagation pass"

8424cd8f

anv/tests: Adding test for the state_pool padding. · 927ba12b

Rafael Antognolli authored Dec 07, 2018



Add a test that checks that we can use the extra space allocated for
padding while allocating larger anv_states.

Reviewed-by: Jason Ekstrand <jason@jlekstrand.net>

927ba12b

anv/allocator: Add support for non-userptr. · 731c4adc

Rafael Antognolli authored Nov 08, 2018



If softpin is supported, create new BOs for the required size and add the
respective BO maps. The other main change of this commit is that
anv_block_pool_map() now returns the map for the BO that the given
offset is part of. So there's no block_pool->map access anymore (when
softpin is used.

v3:
 - set fd to -1 on softpin case (Jason)

Reviewed-by: Jason Ekstrand <jason@jlekstrand.net>

731c4adc

anv: Remove state flush. · 643248b6

Rafael Antognolli authored Jan 15, 2019



We have all the state buffers snooped, so we don't need to clflush
everything anymore.

Reviewed-by: Jason Ekstrand <jason@jlekstrand.net>

643248b6

anv/allocator: Enable snooping on block pool and anv_bo_pool BOs. · 5d61c74f

Rafael Antognolli authored Jan 15, 2019



We are not going to use userptr for anv block pool BOs anymore. However,
so far we have been relying on the fact that userptr BOs are snooped on
non-llc platforms. Let's make sure that the block pool BOs are still
snooped, and we can also remove the clflush'ing that we do on all state
buffers.

And since we plan to remove the flushes, set the anv_bo_pool BOs to
cached (snooped on non-LLC platforms) too. For LLC platforms, they are
all cached by default, so this becomes a no-op.

v5:
 - Add snooping to anv_bo_pool BOs too (Jason).
 - Remove anv_gem_set_domain.

Reviewed-by: Jason Ekstrand <jason@jlekstrand.net>

5d61c74f

anv/allocator: Add padding information. · dfc9ab2c

Rafael Antognolli authored Dec 04, 2018



It's possible that we still have some space left in the block pool, but
we try to allocate a state larger than that state. This means such state
would start somewhere within the range of the old block_pool, and end
after that range, within the range of the new size.

That's fine when we use userptr, since the memory in the block pool is
CPU mapped continuously. However, by the end of this series, we will
have the block_pool split into different BOs, with different CPU
mapping ranges that are not necessarily continuous. So we must avoid
such case of a given state being part of two different BOs in the block
pool.

This commit solves the issue by detecting that we are growing the
block_pool even though we are not at the end of the range. If that
happens, we don't use the space left at the end of the old size, and
consider it as "padding" that can't be used in the allocation. We update
the size requested from the block pool to take the padding into account,
and return the offset after the padding, which happens to be at the
start of the new address range.

Additionally, we return the amount of padding we used, so the caller
knows that this happens and can return that padding back into a list of
free states, that can be reused later. This way we hopefully don't waste
any space, but also avoid having a state split between two different
BOs.

v3:
 - Calculate offset + padding at anv_block_pool_alloc_new (Jason).
v4:
 - Remove extra "leftover".

Reviewed-by: Jason Ekstrand <jason@jlekstrand.net>

dfc9ab2c

anv/allocator: Rework chunk return to the state pool. · 7ed0898a

Rafael Antognolli authored Dec 06, 2018



This commit tries to rework the code that split and returns chunks back
to the state pool, while still keeping the same logic.

The original code would get a chunk larger than we need and split it
into pool->block_size. Then it would return all but the first one, and
would split that first one into alloc_size chunks. Then it would keep
the first one (for the allocation), and return the others back to the
pool.

The new anv_state_pool_return_chunk() function will take a chunk (with
the alloc_size part removed), and a small_size hint. It then splits that
chunk into pool->block_size'd chunks, and if there's some space still
left, split that into small_size chunks. small_size in this case is the
same size as alloc_size.

The idea is to keep the same logic, but make it in a way we can reuse it
to return other chunks to the pool when we are growing the buffer.

v2:
 - Include Jason's suggestions to the algorithm that returns chunks.
 - Update comments.

v3:
 - Disallow returning 0 blocks (Jason).
 - fix min_size in the loop (Jason).
 - remove temporary variables (Jason)
v4:
 - return_chunk() should never return blocks larger than
 pool->block_size.

Reviewed-by: Jason Ekstrand <jason@jlekstrand.net>

7ed0898a

anv: Remove some asserts. · 6a1f4c96

Rafael Antognolli authored Nov 21, 2018



They won't be true anymore once we add support for multiple BOs with
non-userptr.

Reviewed-by: Jason Ekstrand <jason@jlekstrand.net>

6a1f4c96

anv: Validate the list of BOs from the block pool. · f39dad7e

Rafael Antognolli authored Nov 02, 2018



We now have multiple BOs in the block pool, but sometimes we still
reference only the first one in some instructions, and use relative
offsets in others. So we must be sure to add all the BOs from the block
pool to the validation list when submitting commands.

v2:
   - Don't add block pool BOs to the dependency list right before
   execbuf (Jason)
   - Call anv_execbuf_add_bo() to each BO in the block pools (Jason)
   - Use anv_execbuf_add_bo_set() to add surface state dependencies to
   execbuf.

v3:
   - Add comment to the non-softpin case (Jason).

Reviewed-by: Jason Ekstrand <jason@jlekstrand.net>

f39dad7e

anv: Split code to add BO dependencies to execbuf. · 11a5d462

Rafael Antognolli authored Dec 13, 2018



This part of the anv_execbuf_add_bo() code is totally independent of the
BO being added. Let's split it out, so we can reuse it later.

v3: rename to anv_execbuf_add_bo_set (Jason).

Reviewed-by: Jason Ekstrand <jason@jlekstrand.net>

11a5d462

Admin message