iris: Reduce calls to flush_for_cross_batch_dependencies during GPU transcoding
Problem
I observed a potentially avoidable performance issue when profiling GPU-based ASTC transcoding in !19886 (1927832f).
I created a flamegraph using a texture upload microbenchmark (but I modified it to upload 100 images to reduce noise). The graph shows that flush_for_cross_batch_dependencies
takes up 9.2% of the CPU cycles. For reference, st_UnmapTextureImage
(which should contain the above flush calls) takes up 37.4% cycles. The device under test is an Ice Lake laptop (with native ASTC support not advertised).
The reason these cross batch dependencies are occurring is because the new transcoding path uploads the compute shader output using the 3D engine via iris_copy_region
. The microbenchmark hits this frequently because it uploads many ASTC textures (each of them mipmapped) before doing a draw call.
It seems like we could avoid this issue and that doing so would be generally beneficial. Thoughts? An initial idea is added as a proposal below.
Proposal
Instead of flushing a batch when a direct dependency is detected, flush a queue of batches when a circular dependency is detected.
For example, the direct dependency method looks like:
- recording batch
A
- recording batch
B
-
B
will depend on a BO inA
. flushA
- continue recording
B
If using a queue data structure, the circular dependency method would look like:
- recording batch
A
- recording batch
B
-
B
will depend on a BO inA
. addA
, thenB
to queue - continue recording
B
- recording
A
-
A
will depend on a BO inB
. try to addB
thenA
to the queue. skip addingB
to the queue - it's already at the end. notice thatA
is in the queue, but not at the end; flush the queue. - continue recording
A
If actually using a queue would be slow or complex, maybe we can approximate the above behavior to solve those issues.