v3d: implement TMU pipelining

Iago Toral requested to merge itoral/mesa:v3d_tmu_pipelining into master

This allows us to queue more than one TMU operation (instead of immediately emitting a thread switch and LDTMU/TMUWT) to better hide latency, allowing us to postpone LDTMUs an TMUWT instructions (that may stall). This provides a modest performance improvement that is usually in the 1%-3% range, but in some specific cases I have observed it to go a bit over 5%.

This also improves shader-db stats almost across the board:

total instructions in shared programs: 8986697 -> 8975062 (-0.13%)
instructions in affected programs: 5249532 -> 5237897 (-0.22%)
helped: 13975
HURT: 12110
Instructions are helped.

total threads in shared programs: 233272 -> 236802 (1.51%)
threads in affected programs: 3734 -> 7264 (94.54%)
helped: 1799
HURT: 34
Threads are helped.

total uniforms in shared programs: 2670807 -> 2663911 (-0.26%)
uniforms in affected programs: 186620 -> 179724 (-3.70%)
Uniforms are helped.

total max-temps in shared programs: 1449623 -> 1424169 (-1.76%)
max-temps in affected programs: 434345 -> 408891 (-5.86%)
helped: 13592
HURT: 1224
nterval for max-temps %-change: -6.40% -6.15%
Max-temps are helped.

The only caveat is that pipelining extends the liveness of TMU sequences and since we cannot emit TMU spills during an active TMU sequence, for shaders with high register pressure that need TMU spilling, it makes it a lot more difficult to emit TMU spills efficiently, leading to larger spill/fill counts at best or even unable to compile shaders completely at worse. To deal with this, if we fail to register allocate a shader, we disable TMU pipelining and re-compile the shader again. With that the spilling behavior in shader-db isn't significantly affected any more (it should also be possible to disable pipelining completely in the presence of any spilling whatsoever if we think that is better):

total spills in shared programs: 1848 -> 1855 (0.38%)
spills in affected programs: 428 -> 435 (1.64%)
helped: 4

total fills in shared programs: 2931 -> 2937 (0.20%)
fills in affected programs: 611 -> 617 (0.98%)
helped: 5

The TMU can only queue so many requests. This depends on the number of threads used to compile the shader, but also on the size of various FIFOs where TMU inputs and outputs are placed, so the driver needs to know if it can queue the next TMU operation or needs to flush the FIFOs before moving on to emit the current TMU operation. This requires that we know in advance how many register writes we need to emit to process a new TMU operation (since these go straight into one of the FIFOs), so some code refactoring is required to accomplish that. To avoid duplicating the TMU emission logic twice, the series handles this by having TMU code work in two modes, one that just counts register writes and one where the actual register writes are enabled and applies this mode of operation to all forms of TMU operations: general, texturing and image load/store.

There are a number of scenarios where we need to flush outstanding TMU operations. For example, if a TMU operation reads from the result of another one, if we cross control flow block boundaries or across barriers/discards.

The series includes a few commits that are meant to be squashed to ease review.

Edited by Iago Toral

Merge request reports