- 02 Mar, 2022 10 commits
-
-
Acked-by:
Alyssa Rosenzweig <alyssa@collabora.com> Part-of: <!15213>
-
We only have a single prog_data::total_scratch for all shader variants (SIMD 8, 16, 32). Therefore we should always max the total_scratch on top of existing variant. We probably haven't run into that issue before because we compile by increasing SIMD size and higher SIMD size is more likely to spill. But for bindless shaders with return shaders, if the last return part doesn't spill, we completely ignore the previous parts' scratch computation. Signed-off-by:
Lionel Landwerlin <lionel.g.landwerlin@intel.com> Cc: mesa-stable Reviewed-by:
Tapani Pälli <tapani.palli@intel.com> Part-of: <!15193>
-
Seems we already had implemented this feature (see commit 521e1d02 "broadcom/vc5: Add support for anisotropic filtering"), but we didn't enable the proper capability. Also update the maximum level of anistropy supported. Fixes: #4201 Signed-off-by:
Juan A. Suarez Romero <jasuarez@igalia.com> Reviewed-by:
Alejandro Piñeiro <apinheiro@igalia.com> Reviewed-by:
Iago Toral Quiroga <itoral@igalia.com> Part-of: <!15180>
-
Also change the code to preserve certain metadata: control flow is not changed so both block indices and dominance information is preserved. Reviewed-by:
Marcin Ślusarz <marcin.slusarz@intel.com> Part-of: <!15206>
-
Now that we don't sort our nodes we can arrange them so we can easily translate between nodes and temps without a mapping table, just applying an offset. To do this we have a single array of nodes where twe put first the nodes for accumulators and then the nodes for temps. With this setup we can ensure that for any given temp T, its node is always T + ACC_COUNT. Reviewed-by:
Alejandro Piñeiro <apinheiro@igalia.com> Part-of: <!15168>
-
Nodes are allocated in order to registers so initially sorting was used to ensure that nodes with smaller life ranges would be assigned first and therefore be more likely to get accumulators. However, since d81a6e5f now we don't rely on order to make decisions about accumulators and instead we make policy decisions based on actual liveness, so sorting is no longer strictly relevant to this decision. Furthermore, we are not re-sorting nodes after each spill either, since that would probably require that we rebuild the interference graph after each spill (the graph identifies nodes by their index). Shader-db results show a significant improvement in instruction counts, due to more optimal accumulator assignments. The reason for this is that we use a round-robin policy for choosing the next accumulator to assign. The idea behind this is preventing nearby temps to be assigned to the same accumulator so that QPU scheduling is more flexible, but if we sort our nodes, we are basically not assigning temps in program order any more and the round-robin policy becomes less effective: total instructions in shared programs: 13000420 -> 12663189 (-2.59%) instructions in affected programs: 11791267 -> 11454036 (-2.86%) helped: 62890 HURT: 19987 total threads in shared programs: 415874 -> 415870 (<.01%) threads in affected programs: 20 -> 16 (-20.00%) helped: 2 HURT: 4 total uniforms in shared programs: 3711652 -> 3711624 (<.01%) uniforms in affected programs: 43430 -> 43402 (-0.06%) helped: 134 HURT: 173 total max-temps in shared programs: 2144876 -> 2138822 (-0.28%) max-temps in affected programs: 123334 -> 117280 (-4.91%) helped: 4112 HURT: 1195 total spills in shared programs: 3870 -> 3860 (-0.26%) spills in affected programs: 1013 -> 1003 (-0.99%) helped: 14 HURT: 12 total fills in shared programs: 5560 -> 5573 (0.23%) fills in affected programs: 1765 -> 1778 (0.74%) helped: 14 HURT: 17 Reviewed-by:
Alejandro Piñeiro <apinheiro@igalia.com> Part-of: <!15168>
-
total instructions in shared programs: 13014428 -> 13000420 (-0.11%) instructions in affected programs: 743624 -> 729616 (-1.88%) helped: 1392 HURT: 611 total threads in shared programs: 415858 -> 415874 (<.01%) threads in affected programs: 16 -> 32 (100.00%) helped: 8 HURT: 0 total uniforms in shared programs: 3720410 -> 3711652 (-0.24%) uniforms in affected programs: 113442 -> 104684 (-7.72%) helped: 635 HURT: 29 total max-temps in shared programs: 2154268 -> 2144876 (-0.44%) max-temps in affected programs: 61279 -> 51887 (-15.33%) helped: 1124 HURT: 187 total spills in shared programs: 4002 -> 3870 (-3.30%) spills in affected programs: 265 -> 133 (-49.81%) helped: 6 HURT: 0 total fills in shared programs: 5788 -> 5560 (-3.94%) fills in affected programs: 603 -> 375 (-37.81%) helped: 6 HURT: 0 Reviewed-by:
Alejandro Piñeiro <apinheiro@igalia.com> Part-of: <!15168>
-
For us they are basically uniforms too so we want to make their lifespans short to facilitate allocating them to accumulators. total instructions in shared programs: 13043585 -> 13015385 (-0.22%) instructions in affected programs: 8326040 -> 8297840 (-0.34%) helped: 24939 HURT: 19894 total threads in shared programs: 415860 -> 415858 (<.01%) threads in affected programs: 4 -> 2 (-50.00%) helped: 0 HURT: 1 total uniforms in shared programs: 3721953 -> 3720451 (-0.04%) uniforms in affected programs: 96134 -> 94632 (-1.56%) helped: 744 HURT: 435 total max-temps in shared programs: 2173431 -> 2154260 (-0.88%) max-temps in affected programs: 264598 -> 245427 (-7.25%) helped: 10858 HURT: 841 total spills in shared programs: 4005 -> 4010 (0.12%) spills in affected programs: 700 -> 705 (0.71%) helped: 5 HURT: 10 total fills in shared programs: 5801 -> 5817 (0.28%) fills in affected programs: 1346 -> 1362 (1.19%) helped: 6 HURT: 11 Reviewed-by:
Alejandro Piñeiro <apinheiro@igalia.com> Part-of: <!15168>
-
If we are compiling with a strategy that does not allow TMU spills we should not allow spilling anything that is not a uniform. Otherwise the RA cost/benefit algorithm may choose to spill a temp that is not uniform and that will cause us to immediately fail the strategy and fallback to the next one, even if we could've instead chosen to spill more uniforms to compile the program successfully with that strategy. Some relevant shader-db stats: total instructions in shared programs: 13040711 -> 13043585 (0.02%) instructions in affected programs: 234238 -> 237112 (1.23%) helped: 73 HURT: 172 total threads in shared programs: 415664 -> 415860 (0.05%) threads in affected programs: 196 -> 392 (100.00%) helped: 98 HURT: 0 total uniforms in shared programs: 3717266 -> 3721953 (0.13%) uniforms in affected programs: 12831 -> 17518 (36.53%) helped: 6 HURT: 100 total max-temps in shared programs: 2174177 -> 2173431 (-0.03%) max-temps in affected programs: 4597 -> 3851 (-16.23%) helped: 79 HURT: 21 total spills in shared programs: 4010 -> 4005 (-0.12%) spills in affected programs: 55 -> 50 (-9.09%) helped: 5 HURT: 0 total fills in shared programs: 5820 -> 5801 (-0.33%) fills in affected programs: 186 -> 167 (-10.22%) helped: 5 HURT: 0 Reviewed-by:
Alejandro Piñeiro <apinheiro@igalia.com> Part-of: <!15168>
-
Our cost was 5 which matches the number of instructions we have to add for a TMU spill (a fill is 4 instructions). Uniform spills on the other hand add an extra instruction for each fill and remove one instruction for the spill itself. These have a cost of 1. Therefore, if we have a single spill+fill, we end up with +9 instructions if it is a TMU spill and +0 instructions with a uniform spill, so making the former only 5 times more costly is probably not a good idea, and this is without even considering the added latency of the TMU accesses. Relevant shader-db changes show this causes as a marginal instruction count increase in a few shaders but better thread counts and lower TMU spilling overall: total instructions in shared programs: 13037315 -> 13040711 (0.03%) instructions in affected programs: 370106 -> 373502 (0.92%) helped: 187 HURT: 321 total threads in shared programs: 415090 -> 415664 (0.14%) threads in affected programs: 574 -> 1148 (100.00%) helped: 287 HURT: 0 total uniforms in shared programs: 3706674 -> 3717266 (0.29%) uniforms in affected programs: 63075 -> 73667 (16.79%) helped: 40 HURT: 395 total max-temps in shared programs: 2176080 -> 2174177 (-0.09%) max-temps in affected programs: 15838 -> 13935 (-12.02%) helped: 316 HURT: 34 total spills in shared programs: 4247 -> 4010 (-5.58%) spills in affected programs: 2599 -> 2362 (-9.12%) helped: 107 HURT: 14 total fills in shared programs: 6121 -> 5820 (-4.92%) fills in affected programs: 3622 -> 3321 (-8.31%) helped: 108 HURT: 13 Reviewed-by:
Alejandro Piñeiro <apinheiro@igalia.com> Part-of: <!15168>
-
- 01 Mar, 2022 30 commits
-
-
The problem is that dirty_states must be 0 for any state that is NULL in "queued". This code was flagging dirty_states for such states because it was only looking at "emitted". It should have been looking at "queued". Cc: mesa-stable@lists.freedesktop.org Reviewed-by:
Pierre-Eric Pelloux-Prayer <pierre-eric.pelloux-prayer@amd.com> Part-of: <!15209>
-
Fixes: 946bd90a "radeonsi: decrease the size of si_pm4_state::pm4 except for cs_preamble_state" Reviewed-by:
Pierre-Eric Pelloux-Prayer <pierre-eric.pelloux-prayer@amd.com> Part-of: <!15209>
-
Reviewed-by:
Pierre-Eric Pelloux-Prayer <pierre-eric.pelloux-prayer@amd.com> Part-of: <!15209>
-
Reviewed-by:
Pierre-Eric Pelloux-Prayer <pierre-eric.pelloux-prayer@amd.com> Part-of: <!15209>
-
Reviewed-by:
Pierre-Eric Pelloux-Prayer <pierre-eric.pelloux-prayer@amd.com> Part-of: <!15209>
-
Acked-by:
Pierre-Eric Pelloux-Prayer <pierre-eric.pelloux-prayer@amd.com> Part-of: <!15209>
-
Part-of: <!14388>
-
Reviewed-by:
Emma Anholt <emma@anholt.net> Part-of: <!14388>
-
Reviewed-by:
Emma Anholt <emma@anholt.net> Part-of: <!14388>
-
Reviewed-by:
Emma Anholt <emma@anholt.net> Part-of: <!14388>
-
Reviewed-by:
Emma Anholt <emma@anholt.net> Part-of: <!14388>
-
Reviewed-by:
Emma Anholt <emma@anholt.net> Part-of: <!14388>
-
Drivers will use this. Reviewed-by:
Emma Anholt <emma@anholt.net> Part-of: <!14388>
-
moved from radeonsi without the vectorization, which won't be needed for now. We will lower IO in st/mesa instead of radeonsi to get the transform feedback info into store instructions. Reviewed-by:
Emma Anholt <emma@anholt.net> Part-of: <!14388>
-
This is for drivers that have separate store instructions for varyings, system value outputs (such as clip distances), and transform feedback. The flags tell the driver not to store the output to those locations. This will be used by radeonsi initially, and then maybe by a new linker. Reviewed-by:
Emma Anholt <emma@anholt.net> Part-of: <!14388>
-
Reviewed-by:
Emma Anholt <emma@anholt.net> Part-of: <!14388>
-
NIR now fully contains pipe_stream_output_info in shader_info and IO intrinsics if lower_io_variables is true. radeonsi will not use pipe_stream_output_info after this, and other drivers are free to follow that. Reviewed-by:
Emma Anholt <emma@anholt.net> Part-of: <!14388>
-
This will allow compaction of transform feedback varyings because they are no longer tied to varying slots with this information. It will also make transform feedback info available to all NIR passes after IO is lowered. It's meant to replace pipe_stream_output_info. Other intrinsics are not used with transform feedback. Reviewed-by:
Emma Anholt <emma@anholt.net> Part-of: <!14388>
-
gs_streams is relative to the component. Also clear the high bits. Reviewed-by:
Emma Anholt <emma@anholt.net> Part-of: <!14388>
-
Reviewed-by:
Emma Anholt <emma@anholt.net> Part-of: <!14388>
-
Fixes: ad9b5ac0 - radeonsi: more fixes for si_buffer_from_winsys_buffer for GL-VK interop Closes: #6063 Reviewed-by:
Pierre-Eric Pelloux-Prayer <pierre-eric.pelloux-prayer@amd.com> Tested-by:
Michel Dänzer <mdaenzer@redhat.com> Part-of: <!15124>
-
For dmabuf imports, configure the primary surface without support for compression if the modifier doesn't specify it. This helps to create VkImages with memory requirements that are compatible with the buffers apps provide. Suggested-by:
Philip Langdale <philipl@overt.org> Cc: 22.0 <mesa-stable> Closes: #5940 Tested-by:
Philip Langdale <philipl@overt.org> Reviewed-by:
Tapani Pälli <tapani.palli@intel.com> Part-of: <!15181>
-
Use a variable to store the anv_image_create_info struct. We'll modify it for a bug fix in the next patch. Cc: 22.0 <mesa-stable> Tested-by:
Philip Langdale <philipl@overt.org> Reviewed-by:
Tapani Pälli <tapani.palli@intel.com> Part-of: <!15181>
-
Replace the create_info parameter with isl_extra_usage_flags to more closely match the parameters of explicit layout function. Tested-by:
Philip Langdale <philipl@overt.org> Reviewed-by:
Tapani Pälli <tapani.palli@intel.com> Part-of: <!15181>
-
Minor. Signed-off-by:
Alyssa Rosenzweig <alyssa@collabora.com> Part-of: <!15204>
-
Useful across multiple optimization tests. Signed-off-by:
Alyssa Rosenzweig <alyssa@collabora.com> Part-of: <!15204>
-
Disable the Bifrost optimization; it's not portable. Signed-off-by:
Alyssa Rosenzweig <alyssa@collabora.com> Part-of: <!15204>
-
It's only v6 that's missing this feature. Signed-off-by:
Alyssa Rosenzweig <alyssa@collabora.com> Part-of: <!15204>
-
Valhall uses 16-wide warps. Signed-off-by:
Alyssa Rosenzweig <alyssa@collabora.com> Part-of: <!15204>
-
Useful when we grow Valhall support (soon!) Signed-off-by:
Alyssa Rosenzweig <alyssa@collabora.com> Part-of: <!15204>
-