gallium: PBO acceleration
After patient and thoughtful guidance from @mareko, I've been deep into the depths of PBO functionality for the past week and a half. Originally the motivation was just to handle cases for zink where there was no equivalent VK format for a GL format, leading to unacceptable performance drops in apps like RPCS3 and CS:GO due to hitting software fallbacks, but one thing led to another, and I now have an implementation which handles every format in CTS without using any software fallbacks at all, even for compressed formats.
It seems to me like this could be optimized a bit and then, for drivers which support the required features, used as a full replacement for the existing PBO codepaths.
Yeah we're doing subheadings.
This is a completely separate implementation that I imagine as being enabled by drivers using a pipe cap:
PIPE_CAP_PREFER_BLIT_BASED_TEXTURE_TRANSFERis the current name of the cap, but I plan to change this to something like
PIPE_CAP_TEXTURE_TRANSFER_METHODwith an enum that extends the current values (0 for CPU, 1 for blit-based, 2 for compute)
Once enabled, the new mechanism determines a src and dst format, packs a vec4-sized constant buffer with params, and then fires off a launch_grid call with an ubershader that can handle every possible conversion (*some restrictions apply) in order to cut down on the number of shaders required. The shader reads from a samplerview and writes to an ssbo, and it's cached in
st->pbo because these things are unbelievably slow to compile at runtime.
I started with glGetTextureSubImage since it was easier to learn how formats and such worked going in that direction, so that's what I have at present.
st_GetTexSubImage_shader() allocates an SSBO for the size required and then performs the format conversion in the shader, doing a memcpy afterwards to write back to the destination buffer. If there are no rowlength/imageheight/skip parameters set for pixelstore, this is always a single, direct memcpy, otherwise it's an iterated memcpy to handle the row/column skipping.
GL_PACK_SWAP_BYTES is handled by the shader, as is alignment.
Obviously we're gonna need benchmarks for PBO performance now, so I started one today based on the perf infrastructure in piglit (e.g., drawoverhead). I call it pbobench and it's still very humble in its current state, but it does work and do the following:
- create a 1024x1024 2D texture
- populate with R32F pixel data (for convenience and size)
- perform glGetTextureSubImage on it iterating over the available format conversions and power-of-two texture sizes
The results were sort of what I expected, but they were also quite interesting in some places:
- radeonsi - zmike/mesa$1956
- radeonsi + compute - zmike/mesa$1957
- iris - zmike/mesa$1958
- iris + compute - zmike/mesa$1959
A note on the second iris result is that according to perf, >20% of time is spent just creating samplerviews (not visible at all on radeonsi), so it seems like there's maybe some overhead that can be examined and optimized there to improve results directly.
Other than that, however, what's interesting to note is that as the texture downloads get larger, the performance gap starts to diminish, and in quite a few cases, the compute shader is much, much faster on both drivers. Given that this is essentially the first shader I've ever written which wasn't just for a test case, I assume there's optimization that can be done here, but the initial perf seems quite good.
There's actually quite a few apps which use PBO downloads in hotpaths, though my personal test cases are CS:GO and RPCS3; the former does a ton of PBO uploads all the time (mostly alpha/luminance), and the latter does full-window downloads with a high degree of regularity.
I'm still cleaning this up (a lot) before I put up an MR, but anyone interested can find the current state of this in zmike/fffffffffff.