Commits · vectorize_io_v2 · Timothy Arceri / mesa

Feb 16, 2019

i965/anv: use nir_opt_vectorize_io() · 02f1fe20

Timothy Arceri authored Oct 24, 2018

Commit 8d822246 caused substantially more URB messages in
geometry and tessellation shaders (due to enabling
nir_lower_io_to_scalar_early). This combines io again to avoid
this regression while still allowing link time optimisation of
components.

Shader-db results (SKL):

total instructions in shared programs: 13109035 -> 13107191 (-0.01%)
instructions in affected programs: 66278 -> 64434 (-2.78%)
helped: 242
HURT: 13

total cycles in shared programs: 332090418 -> 332094364 (<.01%)
cycles in affected programs: 285477 -> 289423 (1.38%)
helped: 39
HURT: 215

Bugzilla: https://bugs.freedesktop.org/show_bug.cgi?id=107510

02f1fe20

nir: add nir_opt_vectorize_io() · ef96949d

Timothy Arceri authored Oct 23, 2018

Once linking opts are done this pass recombines varying components.

This patch is loosely based on Connor's vectorize alu pass.

V2: skip fragment shaders

V3:
- dont accidentally vectorise local vars
- pass correct component to create_new_store()

ef96949d

nir: add glsl_replace_vector_type() · 2195a673

Timothy Arceri authored Oct 23, 2018

This creates a new glsl_type with the specified number on components.

We will use this in the following patch when vectorising io.

2195a673

Feb 15, 2019

swr/rast: Add translation support to streamout · f695e433
Alok Hota authored Sep 14, 2018
```
Reviewed-by: Bruce Cherniak <bruce.cherniak@intel.com>
```
f695e433

swr/rast: simdlib cleanup, clipper stack space fixes · a7fa0cc0

Alok Hota authored Sep 13, 2018



Reduce stack space used by clipper, which had lead to crashes in some
versions for MSVC

Reviewed-by: Bruce Cherniak <bruce.cherniak@intel.com>

a7fa0cc0

swr/rast: convert DWORD->uint32_t, QWORD->uint64_t · f9c29a30
Alok Hota authored Sep 12, 2018
```
Reviewed-by: Bruce Cherniak <bruce.cherniak@intel.com>
```
f9c29a30
swr/rast: Refactor scratch space variable names · c503b588
Alok Hota authored Sep 11, 2018
```
Reviewed-by: Bruce Cherniak <bruce.cherniak@intel.com>
```
c503b588

swr/rast: FP consistency between POSH/RENDER pipes · 0b4db437

Alok Hota authored Aug 28, 2018



- Ensure all threads have optimal floating-point control state
- Disable auto-generation of fused FP ops for VERTEX shader stage
- Disable "fast" FP ops for VERTEX shader stage

Reviewed-by: Bruce Cherniak <bruce.cherniak@intel.com>

0b4db437

swr/rast: Move knob defaults to generated cpp file · dc7b3c95

Alok Hota authored Aug 23, 2018



Reduces amount of compile churn when testing different default values

Reviewed-by: Bruce Cherniak <bruce.cherniak@intel.com>

dc7b3c95

swr/rast: Flip BitScanReverse index calculation · 05e4ff33

Alok Hota authored Aug 14, 2018

The intrinsic returns the number of leading zeros, not the bit number of
the first nonzero, so just flip it based on the mask size

Reviewed-by: Bruce Cherniak <bruce.cherniak@intel.com>

05e4ff33

swr/rast: Correctly align 64-byte spills/fills · ae400a9b

Alok Hota authored Aug 13, 2018



Fixes crashes on some compute shaders when running on AVX512

Reviewed-by: Bruce Cherniak <bruce.cherniak@intel.com>

ae400a9b

swr/rast: Disable use of __forceinline by default · 78bab664

Alok Hota authored Aug 02, 2018



- Was not useful to inline in release builds
- FORCEINLINE can be used if absolutely necessary

Reviewed-by: Bruce Cherniak <bruce.cherniak@intel.com>

78bab664

swr/rast: Convert system memory pointers to gfxptr_t · 20d5c887
Alok Hota authored Jul 19, 2018
```
Fulfills an unused internal interface

Reviewed-by: Bruce Cherniak <bruce.cherniak@intel.com>
```
20d5c887

radv: Use correct num formats to detect whether we should be use 1.0 or 1. · 4b03a19a

Bas Nieuwenhuizen authored Feb 15, 2019

normalized and scaled formats also return floats.

Fixes: 4b3549c0 ("radv: reduce the number of loaded channels for vertex input fetches")
Reviewed-by: Samuel Pitoiset <samuel.pitoiset@gmail.com>

4b03a19a

nir/algebraic: Simplify comparison with sequential integers starting with 0 · 979b43b3

Ian Romanick authored Feb 11, 2019



All of the affected shaders are Unreal4 demos.

All Gen6+ platforms had similar results. (Skylake shown)
total instructions in shared programs: 15437170 -> 15437001 (<.01%)
instructions in affected programs: 21536 -> 21367 (-0.78%)
helped: 43
HURT: 0
helped stats (abs) min: 1 max: 4 x̄: 3.93 x̃: 4
helped stats (rel) min: 0.68% max: 1.01% x̄: 0.80% x̃: 0.80%
95% mean confidence interval for instructions value: -4.07 -3.79
95% mean confidence interval for instructions %-change: -0.83% -0.77%
Instructions are helped.

total cycles in shared programs: 383007896 -> 383007378 (<.01%)
cycles in affected programs: 158640 -> 158122 (-0.33%)
helped: 38
HURT: 4
helped stats (abs) min: 1 max: 48 x̄: 13.89 x̃: 6
helped stats (rel) min: 0.03% max: 1.01% x̄: 0.33% x̃: 0.19%
HURT stats (abs)   min: 2 max: 3 x̄: 2.50 x̃: 2
HURT stats (rel)   min: 0.06% max: 0.09% x̄: 0.08% x̃: 0.08%
95% mean confidence interval for cycles value: -16.90 -7.77
95% mean confidence interval for cycles %-change: -0.39% -0.19%
Cycles are helped.

Iron Lake and GM45 had similar results. (Iron Lake shown)
total instructions in shared programs: 8213746 -> 8213745 (<.01%)
instructions in affected programs: 127 -> 126 (-0.79%)
helped: 1
HURT: 0

total cycles in shared programs: 187734146 -> 187734144 (<.01%)
cycles in affected programs: 2132 -> 2130 (-0.09%)
helped: 1
HURT: 0

Reviewed-by: Jason Ekstrand <jason@jlekstrand.net>

979b43b3

nir/algebraic: Convert some f2u to f2i · ad059202

Ian Romanick authored Feb 12, 2019



Section 5.4.1 (Conversion and Scalar Constructors) of the GLSL 4.60 spec
says:

     It is undefined to convert a negative floating-point value to an
     uint.

Assuming that (uint)some_float behaves like (uint)(int)some_float allows
some optimizations in the i965 backend to proceed.

This basically undoes the small amount of damage done by
"intel/compiler: Avoid propagating inequality cmods if types are
different".

v2: Replicate part of the commit message as a comment in the code.
Suggested by Jason.

shader-db results compairing *before* "intel/compiler: Avoid propagating
inequality cmods if types are different" and after this commit:

Skylake
total cycles in shared programs: 383007996 -> 383007896 (<.01%)
cycles in affected programs: 85208 -> 85108 (-0.12%)
helped: 13
HURT: 8
helped stats (abs) min: 2 max: 26 x̄: 10.77 x̃: 6
helped stats (rel) min: 0.09% max: 0.65% x̄: 0.28% x̃: 0.14%
HURT stats (abs)   min: 2 max: 12 x̄: 5.00 x̃: 3
HURT stats (rel)   min: 0.04% max: 0.32% x̄: 0.12% x̃: 0.07%
95% mean confidence interval for cycles value: -9.31 -0.21
95% mean confidence interval for cycles %-change: -0.24% <.01%
Cycles are helped.

Broadwell
total cycles in shared programs: 415251194 -> 415251370 (<.01%)
cycles in affected programs: 83750 -> 83926 (0.21%)
helped: 7
HURT: 13
helped stats (abs) min: 10 max: 12 x̄: 11.43 x̃: 12
helped stats (rel) min: 0.30% max: 0.30% x̄: 0.30% x̃: 0.30%
HURT stats (abs)   min: 2 max: 36 x̄: 19.69 x̃: 22
HURT stats (rel)   min: 0.05% max: 0.89% x̄: 0.44% x̃: 0.47%
95% mean confidence interval for cycles value: 0.76 16.84
95% mean confidence interval for cycles %-change: <.01% 0.37%
Inconclusive result (%-change mean confidence interval includes 0).

Haswell
total instructions in shared programs: 13823885 -> 13823886 (<.01%)
instructions in affected programs: 2249 -> 2250 (0.04%)
helped: 0
HURT: 1

total cycles in shared programs: 390094243 -> 390094001 (<.01%)
cycles in affected programs: 85640 -> 85398 (-0.28%)
helped: 15
HURT: 6
helped stats (abs) min: 4 max: 26 x̄: 18.53 x̃: 18
helped stats (rel) min: 0.09% max: 0.66% x̄: 0.47% x̃: 0.42%
HURT stats (abs)   min: 2 max: 14 x̄: 6.00 x̃: 2
HURT stats (rel)   min: 0.04% max: 0.37% x̄: 0.15% x̃: 0.04%
95% mean confidence interval for cycles value: -17.36 -5.69
95% mean confidence interval for cycles %-change: -0.44% -0.14%
Cycles are helped.

Ivy Bridge
total cycles in shared programs: 180986448 -> 180986552 (<.01%)
cycles in affected programs: 34835 -> 34939 (0.30%)
helped: 0
HURT: 10
HURT stats (abs)   min: 2 max: 18 x̄: 10.40 x̃: 10
HURT stats (rel)   min: 0.06% max: 0.36% x̄: 0.28% x̃: 0.30%
95% mean confidence interval for cycles value: 4.67 16.13
95% mean confidence interval for cycles %-change: 0.20% 0.35%
Cycles are HURT.

Sandy Bridge
total cycles in shared programs: 154603969 -> 154603970 (<.01%)
cycles in affected programs: 171514 -> 171515 (<.01%)
helped: 25
HURT: 14
helped stats (abs) min: 1 max: 4 x̄: 1.80 x̃: 1
helped stats (rel) min: 0.02% max: 0.10% x̄: 0.04% x̃: 0.04%
HURT stats (abs)   min: 1 max: 8 x̄: 3.29 x̃: 3
HURT stats (rel)   min: 0.03% max: 0.28% x̄: 0.10% x̃: 0.11%
95% mean confidence interval for cycles value: -0.91 0.96
95% mean confidence interval for cycles %-change: -0.02% 0.04%
Inconclusive result (value mean confidence interval includes 0).

No changes on Iron Lake or GM45.

Reviewed-by: Jason Ekstrand <jason@jlekstrand.net>

ad059202

intel/compiler/test: Add unit test for mismatched signedness comparison · ac21dd4a

Matt Turner authored Feb 11, 2019

v2 (idr): Move adding the test to after adding the fix.  Reordering the
two commits prevents possible headaches for git-bisect with scripts that
always do 'ninja check'.

Bugzilla: https://bugs.freedesktop.org/show_bug.cgi?id=109404


Reviewed-by: Ian Romanick <ian.d.romanick@intel.com>

ac21dd4a

intel/compiler: Avoid propagating inequality cmods if types are different · 2dff9a66

Matt Turner authored Feb 11, 2019

v2: Fix silly bug in logic.  s/||/&&/

All but one of the affected shaders is in an Unreal4 demo.  The other is
in Tomb Raider.  All of the cases that Ian investigated appear to be
sequences like the following

    if (int(uint(some_float)) < 0) /* other relations too */
        ...

At least in Tomb Raider, it's not obvious that this sequence came from
the original shader.

In some of the Unreal demos, the shader contains code like

    if (int(uint(textureLod(...))) > 0)
        ...

which explicitly generates the offending sequence.

All Gen6+ platforms had similar results (Skylake shown):
total instructions in shared programs: 15437170 -> 15437187 (<.01%)
instructions in affected programs: 4492 -> 4509 (0.38%)
helped: 0
HURT: 17
HURT stats (abs)   min: 1 max: 1 x̄: 1.00 x̃: 1
HURT stats (rel)   min: 0.05% max: 0.73% x̄: 0.66% x̃: 0.73%
95% mean confidence interval for instructions value: 1.00 1.00
95% mean confidence interval for instructions %-change: 0.57% 0.75%
Instructions are HURT.

total cycles in shared programs: 383007996 -> 383007992 (<.01%)
cycles in affected programs: 20542 -> 20538 (-0.02%)
helped: 6
HURT: 7
helped stats (abs) min: 2 max: 6 x̄: 5.33 x̃: 6
helped stats (rel) min: 0.11% max: 0.36% x̄: 0.32% x̃: 0.36%
HURT stats (abs)   min: 4 max: 4 x̄: 4.00 x̃: 4
HURT stats (rel)   min: 0.27% max: 0.27% x̄: 0.27% x̃: 0.27%
95% mean confidence interval for cycles value: -3.30 2.69
95% mean confidence interval for cycles %-change: -0.19% 0.19%
Inconclusive result (value mean confidence interval includes 0).

No changes on Iron Lake or GM45.

Bugzilla: https://bugs.freedesktop.org/show_bug.cgi?id=109404


Reviewed-by: Ian Romanick <ian.d.romanick@intel.com>
Reviewed-by: Jason Ekstrand <jason@jlekstrand.net>
Tested-by:  <nagrigoriadis@gmail.com>
Tested-by: Danylo Piliaiev <danylo.piliaiev@gmail.com>

2dff9a66

intel/compiler/test: Set devinfo->gen = 7 · e50db60d

Matt Turner authored Feb 11, 2019



We emit an FBL instruction which only exists since Gen7. This prevents
the test from segfaulting when run with TEST_DEBUG=1.

Reviewed-by: Ian Romanick <ian.d.romanick@intel.com>

e50db60d

gallium/auxiliary/vl: Add video compositor compute shader render · 9364d66c

James Zhu authored Feb 01, 2019 and

Leo Liu committed Feb 15, 2019

Add compute shader initilization, assign and cleanup in vl_compositor API.
Set video compositor compute shader render as default when pipe support it.

Signed-off-by: James Zhu <James.Zhu@amd.com>
Reviewed-by: Christian König <christian.koenig@amd.com>

9364d66c

gallium/auxiliary/vl: Add compute shader to support video compositor render · f6ac0b5d

James Zhu authored Feb 01, 2019 and

Leo Liu committed Feb 15, 2019



Add compute shader to support video compositor render.

Signed-off-by: James Zhu <James.Zhu@amd.com>
Acked-by: Christian König <christian.koenig@amd.com>

f6ac0b5d

gallium/auxiliary/vl: Rename csc_matrix and increase its size. · 299e2bc0

James Zhu authored Feb 01, 2019 and

Leo Liu committed Feb 15, 2019



Rename csc_matrix to shader_params, and increase shader_params size
to store more constants for compute shader,

Signed-off-by: James Zhu <James.Zhu@amd.com>
Reviewed-by: Christian König <christian.koenig@amd.com>

299e2bc0

gallium/auxiliary/vl: Split vl_compositor graphic shaders from vl_compositor API · 7b7b5f20

James Zhu authored Feb 05, 2019 and

Leo Liu committed Feb 15, 2019



Split vl_compositor graphic shaders from vl_compositor API in order to share
vl_compositor API with vl_compositor compute shader later.

Signed-off-by: James Zhu <James.Zhu@amd.com>
Reviewed-by: Christian König <christian.koenig@amd.com>

7b7b5f20

gallium/auxiliary/vl: Move dirty define to header file · b34d7c5d

James Zhu authored Feb 01, 2019 and

Leo Liu committed Feb 15, 2019



Move dirty define to header file to share with compute shader.

Signed-off-by: James Zhu <James.Zhu@amd.com>
Reviewed-by: Christian König <christian.koenig@amd.com>

b34d7c5d

nir: remove jump from two merging jump-ending blocks · 1fb24080

Juan A. Suárez authored Feb 12, 2019



In opt_peel_initial_if optimization, when moving the continue list to
end of the continue block, before the jump, could happen that the
continue list itself also ends with a jump.

This would mean that we would have two jump instructions in a row: the
first one from the continue list and the second one from the contine
block.

As inserting an instruction after a jump is not allowed (and it does not
make sense, as it will not be executed), remove the jump from the
continue block and keep the one from continue list, as it will be
executed first.

CC: Jason Ekstrand <jason@jlekstrand.net>
Reviewed-by: Caio Marcelo de Oliveira Filho <caio.oliveira@intel.com>

1fb24080

nir: move ALU instruction before the jump instruction · 69be9934

Juan A. Suárez authored Feb 12, 2019

opt_split_alu_of_phi moves ALU instruction to the end of continue block.

But if the continue block ends with a jump instruction (an explicit
"continue" instruction) then the ALU must be inserted before the jump,
as it is illegal to add instructions after the jump.

CC: Ian Romanick <ian.d.romanick@intel.com>
Fixes: 0881e90c ("nir: Split ALU instructions in loops that read phis")
Reviewed-by: Ian Romanick <ian.d.romanick@intel.com>

69be9934

mesa: INVALID_VALUE for wrong type or format in Clear*Buffer*Data · a43596df

Andres Gomez authored Feb 12, 2019



Instead of generating a GL_INVALID_ENUM error when the type or format
is incorrect while using glClear{Named}Buffer{Sub}Data, generate
GL_INVALID_VALUE.

From page 72 (page 94 of the PDF) of the OpenGL 4.6 spec:

  " An INVALID_VALUE error is generated if type is not one of the
    types in table 8.2.

    An INVALID_VALUE error is generated if format is not one of the
    formats in table 8.3."

Fixes the following test:
KHR-GL45.direct_state_access.buffers_errors

v2: correct the doxygen documentation.

Cc: Pi Tabred <servuswiegehtz@yahoo.de>
Cc: Brian Paul <brianp@vmware.com>
Signed-off-by: Andres Gomez <agomez@igalia.com>
Reviewed-by: Tapani Pälli <tapani.palli@intel.com>

a43596df

virgl: use virgl_transfer_inline_write even less · 67426ccd

Gurchetan Singh authored Feb 06, 2019 and

Gert Wollny committed Feb 15, 2019



We've noticed the Team Fortress 2 engine seems to do many small
calls to glSubData(..). Let's pick our heuristic based on the
resource base width, not the size of a particular upload.
This will cause transfers to be batched together in the transfer
queue.

Revelant glbench microbenchmark --

Before: buffer_upload_dynamic_element_array_131072 = 131.17 mbytes_sec
After: buffer_upload_dynamic_element_array_131072 = 6828.24 mbytes_sec
Reviewed-by: Gert Wollny <gert.wollny@collabora.com>

67426ccd

virgl: use transfer queue · f0e71b10

Gurchetan Singh authored Jan 03, 2019 and

Gert Wollny committed Feb 15, 2019



This improves Unigine Valley benchmark by 3 to 10 fps (depending
on the scene).

It also improves the Team Fortress 2 benchmark from 6 fps to 13
fps (host: 20 fps).

Reviewed-by: Gert Wollny <gert.wollny@collabora.com>

f0e71b10

virgl: introduce transfer queue · 4a7857b3

Gurchetan Singh authored Dec 28, 2018 and

Gert Wollny committed Feb 15, 2019



Transfers will be placed here at unmap time instead of incurring
a VM exit. There's an attempt to deduplicate intersecting 1D transfers,
which are surprisingly common.

This can also help with mipmapped texture upload and smaller
textures, where the majority of the time is spent in the guest
kernel / QEMU -- not virglrenderer.  This is shown by the GLbench
texture upload benchmark:

Before:
    texture_upload_rgba_teximage2d_32 = 64.23 mtexel_sec
After:
    texture_upload_rgba_teximage2d_32 = 367.44 mtexel_sec

v2: Split up list iteration functions (@gerddie)
v3: Support for optimizing glBufferSubData
Reviewed-by: Gert Wollny <gert.wollny@collabora.com>

4a7857b3

virgl: add encoder functions for new protocol · 9c493094
Gurchetan Singh authored Nov 28, 2018 and Gert Wollny committed Feb 15, 2019
```
Let's encode the new protocol with new helper functions.

Reviewed-by: Gert Wollny <gert.wollny@collabora.com>
```
9c493094

virgl: make winsys modifications for encoded transfers · 5510cc67

Gurchetan Singh authored Jan 03, 2019 and

Gert Wollny committed Feb 15, 2019



The idea is to have two command buffers:

1) One for transfers
2) One for commands, which can include transfers

At flush time, (2) will be filled.  Otherwise, (1) will be
used to submit transfers if there are enough of them.

v2: Pass size directly to cmd_buf_create (@gerddie)
Reviewed-by: Gert Wollny <gert.wollny@collabora.com>

5510cc67

virgl: add extra checks in virgl_res_needs_flush_wait · 90e96505

Gurchetan Singh authored Feb 05, 2019 and

Gert Wollny committed Feb 15, 2019



This is motivated by the following scenario:

glSubBufferData(GL_ARRAY_BUFFER, ...)
glFlush(..)
glSubBufferData(GL_ARRAY_BUFFER, ...)
glSubBufferData(GL_ARRAY_BUFFER, ...)
glSubBufferData(GL_ARRAY_BUFFER, ...)

This increases @davidriley's Team Fortress 2 apitrace from
1 fps to 6 fps and helps with the Chromium glbench
microbenchmarks:

Before: texture_update_rgba_texsubimage2d_2048 = 554.96 mtexel_sec
   buffer_upload_dynamic_array_12 = 0.02 mbytes_sec
   buffer_upload_dynamic_array_576 = 1.07 mbytes_sec
After: texture_update_rgba_texsubimage2d_2048 = 612.29 mtexel_sec
   buffer_upload_dynamic_array_12 = 2.22 mbytes_sec
   buffer_upload_dynamic_array_576 = 164.89 mbytes_sec
Reviewed-by: Gert Wollny <gert.wollny@collabora.com>

90e96505

virgl: pass virgl transfer to virgl_res_needs_flush_wait · ab6ea6e9
Gurchetan Singh authored Feb 08, 2019 and Gert Wollny committed Feb 15, 2019
```
Reviewed-by: Gert Wollny <gert.wollny@collabora.com>
```
ab6ea6e9
virgl: keep track of number of computations · d98fbd9c
Gurchetan Singh authored Feb 05, 2019 and Gert Wollny committed Feb 15, 2019
```
It's good to keep track of these things.

Reviewed-by: Gert Wollny <gert.wollny@collabora.com>
```
d98fbd9c

virgl: limit command length to 16 bits · 35515985

Gurchetan Singh authored Jan 23, 2019 and

Gert Wollny committed Feb 15, 2019



Much of our logic is based around the idea the upper 16 bits
of a command dword can encode the length of the command.

Now that the command buffer >= 2^16 - 1, we should check for
this.

v2: alignment, and only check VIRGL_ENCODE_MAX_DWORDS
Reviewed-by: Gert Wollny <gert.wollny@collabora.com>

35515985

virgl: use virgl_transfer in inline write · 503ffe46

Gurchetan Singh authored Nov 28, 2018 and

Gert Wollny committed Feb 15, 2019



Let's define a helper function and use it.

This commit also allows resources to be emitted into different command
buffers.

Like the ioctls, send 0 for layer_stride and stride.  If we actually
send the real values, there are various assumptions in virglrenderer
for non-1D buffers that may need to be modified.

Reviewed-by: Gert Wollny <gert.wollny@collabora.com>

503ffe46

virgl: add protocol for resource transfers · 0fcd48ba

Gurchetan Singh authored Nov 19, 2018 and

Gert Wollny committed Feb 15, 2019



Mostly similar to VIRGL_CCMD_RESOURCE_INLINE_WRITE.  However, this
uses the resource's already attached iovecs rather than the command
buffer to transfer the data.

v2: Used (1 << 16) not (1 << 15) [@gerddie]
Reviewed-by: Gert Wollny <gert.wollny@collabora.com>

0fcd48ba

virgl: when creating / freeing transfers, pass slab pool directly · 168c3ffc
Gurchetan Singh authored Jan 03, 2019 and Gert Wollny committed Feb 15, 2019
```
This will allow us to destroy transfers w/o having a pointer
to the context.

Reviewed-by: Gert Wollny <gert.wollny@collabora.com>
```
168c3ffc

virgl: unmap uploader at flush time · d5c2dacc

Gurchetan Singh authored Jan 07, 2019 and

Gert Wollny committed Feb 15, 2019



This should save some memory when allocating and freeing transfers.

Reviewed-by: Gert Wollny <gert.wollny@collabora.com>

d5c2dacc

Admin message