panfrost: Plan a strategy for better meta operations
Both our Gallium and Vulkan drivers need to implement a number of "blit-like" operations. There are a lot of operations to worry about, there's a different set for each API, and there are lots of potential implementations with different performance/maintainability trade-offs. So far we've been implementing things in a fairly "ad hoc" manner and it's gotten out of hand (especially on the Vulkan side). Moving forward, we need a plan.
This issue attempts to explain the problem space so we can figure out solutions.
Blits
Filtered blit operations (OpenGL: glBlitFramebuffer
. Vulkan:
vkCmdBlitImage
). The semantics for blits can be complicated. As Mali does not
have blitting hardware, blits need to be translated to draws.
In Gallium, we use u_blitter
for the translation. This works very well, and
rarely deserves more thought.
In Vulkan, we roll our own implementation in pan_blit
. This code is
complicated and non-portable (hardcoding lots of Mali-specific details). The
code is only used by PanVK. Porting PanVK to Valhall is blocked on (among
others) that port.
In theory, this might get slightly better performance than u_blitter
, as
u_blitter
uses a vertex shader which PanVK skips by issuing a tiler job
directly, supplying the varyings ahead of time. In practice, this is a minor
optimization only. It also turns out to be possible to implement with
u_blitter
, by implementing the draw_rectangle
callback ourselves to skip the
vertex shader. This depends on some internal implementation details of
u_blitter
(namely that the vertex shader is always passthrough), but works
okay in practice. By leveraging existing pan_blitter
and Gallium driver code,
this isn't too invasive. !17499 (closed) implements a proof-of-concept of this
optimization.
@jekstrand is working on common meta code. When !17438 (closed) is finished and merged,
we plan to switch PanVK to use it and drop the relevant code paths in
pan_blit
, resulting in a u_blitter
-like situation. Like `u_blitter, this
common Vulkan implementation will allow the driver to override the "rectangle
draw" call, so we can implement our Mali tricks here too.
Clear image
Clear to an arbitrary image (Vulkan: vkCmdClearColorImage
.) Nonportable
implementation in PanVK. To be replaced by !17438 (closed).
Clear attachment
Clear to a bound attachment (OpenGL: glClear
. Vulkan:
vkCmdClearAttachments
.) u_blitter
implementation in Gallium since !17112 (merged).
Mostly equivalent nonportable implementation in PanVK. To be replaced by !17438 (closed).
For depth/stencil clears in particular, we could do a bit better on Valhall with "full screen draws", a new job type with the property that the hardware can skip adjacent full screen draws with the same draw call descriptor. In other words, full screen draws must be idempotent. This implements a minor optimization for attachment clears and makes them otherwise unsuitable as a general fullscreen draw primitive. It's unclear whether this actually makes a difference even for depth/stencil clears... couldn't we detect that pattern on the CPU anyway?
Fill buffer
Fill a buffer with a particular pattern (OpenGL: glClearBufferSubData
. OpenCL:
clEnqueueFillBuffer
. Vulkan: vkCmdFillBuffer
). This corresponds to the
Gallium hook clear_buffer
. When clear_buffer
isn't implemented, mesa/st
provides a software fallback. Clover/Rusticl do not, so we implement
clear_buffer
with u_default_clear_buffer
(since !16044 (merged)), which is another
software fallback.
Software fallbacks here have performance issues. Filling should instead be implemented as a compute shader. @mareko proposed that we implement "something like u_blitter that uses compute shaders to implement clears and blits" at the Gallium level, which may be important for OpenCL performance. Other than the specific vector sizes required for optimal throughput, Mali has no 'specific' performance requirements here. If/when a common compute shader implementation is merged, it will be trivial to switch the Gallium driver.
On the Vulkan side, PanVK implements nonportably. To be replaced by !17438 (closed).
Copies
Various copy operations are required by APIs (Vulkan: vkCmdCopyImage
,
vkCmdCopyImageToBuffer
, etc. OpenGL: glCopyImageSubData
,
glCopyBufferSubData
, etc.)
In Gallium, these get mapped to the resource_copy_region
hook, which currently
falls back to software in util_resource_copy_region
. In addition to being
slow, this conflicts with AFBC (framebuffer compression that we apply even to
textures that are otherwise uncompressed). We've disabled ARB_copy_image
in
the mean time because that extension was a mistake, but I digress.
In Vulkan, these are implemented nonportably with a complicated mix of compute and fragment shaders. The manual descriptor packing should be easy enough to replace with meta operations, which will remove the main source of portability problems across Malis (and unblock basic Valhall/PanVK support). Due to AFBC, it's not clear to me whether copy operations can be implemented generically (and efficiently) on the GPU, even with !17438 (closed). Here are some considerations that make this tricky:
-
Using the OpenGL meanings of the terms, textures and framebuffers support AFBC but shader images do not. As soon as a texture is bound as a shader image (for read or write), the Gallium driver triggers a decompression blit and then AFBC is lost for the rest of the resource's lifetime. PanVK doesn't support AFBC yet but when it will, it won't be able to use AFBC for resources with the writeable image usage flags. This implies that any operations writing to an image need to happen with fragment shaders. (
vkCmdCopyImageToBuffer
should be a compute shader that usestexelFetch
and an SSBO write, whereasvkCmdCopyBufferToImage
should be a fragment shader that uses an SSBO read and agl_FragColor
write.) -
AFBC depends on the specific pixel format. It has transforms that depend on the colour channels themselves, it doesn't simply treat pixels as a "bag of bits" like the copy routines do. Copying between incompatible AFBC formats (
R8G8 <---> R5G6B5
for example) requires some nasty bit packing in the shader, as PanVK implements. These cases should be rare and could be solved by copying to an intermediate (possibly uncompressed) "staging" resource, which is the approach turnip takes. Unlike with Gallium, they shouldn't be solved by disabling AFBC. -
Starting with Valhall, the driver can specify the physical AFBC format separate from the logical pixel format accessed by the shader. This feature allows implementing texture views efficiently, trivializing problem #2 (closed) on Valhall provided the functionality can be piped through the driver. This is not a "get out of jail free" card if we want to continue supporting Bifrost in PanVK.
I am hopeful that !17438 (closed) will help some of these problems for PanVK, but I suspect we'll still need some driver magic. We'll see what @jekstrand can cook up in common.
I don't know if we need to support this (efficiently or at all) in the GL driver. For now I am happy to keep this as a Vulkan-specific craziness. If we do need it, we'll cross that bridge when we get there.
Tilebuffer reloads
Mali GPUs are tilers. At the start of a render pass, the tilebuffer is always
cleared. APIs sometimes require preserve framebuffer contents (OpenGL: no
glClear
. Vulkan: loadOp = LOAD_OP_LOAD
). The driver must insert a fragment
draw at the beginning of the render pass that binds the existing framebuffer
contents as a texture, samples them, and writes the results into the
tilebuffer.
The required pipeline is similar to "regular" blits. However, it may be
dispatched specially, as it must precede all other draws (OpenGL: driver does
not currently know whether the reload is necessary until the end of the frame,
although this could be reworked. Vulkan: unnecessary.) On Midgard, this is
handled by a special "inject" option in pan_scoreboard
. On Bifrost+, this is
handled by a pre-frame shader that is dispatched. Tilebuffer reloads are the
only current users of frame shaders.
Reloads are currently implemented in pan_blit
. This code is entirely
Mali-specific (with significant portability challenges for new hardware,
including v10). The implementation is shared between APIs.
Mipmap generation
The application may supply a base image and require the driver to generate a
full miptree (OpenGL: glGenerateMipmap
. Vulkan: not available, draft extension
was abandoned due to lack of interest.)
Gallium has generic code (util_gen_mipmap
) to translate the generate_mipmap
calls to a series of blit
calls for each level, which Panfrost uses. Then the
blits are implemented with ([vertex and] tiler and) fragment jobs with
u_blitter
.
The trouble with this approach is that each blit has a different destination,
corresponding to a different FBO and hence a different panfrost_batch
. So if
we need to generate N levels of mipmaps, we need (at least) N job chains. Under
our current UABI, that involves (at least) N roundtrips to the kernel. That is a
lower bound, the actual performance is substantially worse due to the pervasive
assumption that batches are large and that we can amortize expensive operations
like memory allocation over the entire batch.
Generating mipmaps efficiently requires "merging" these batches into a single "superbatch" that writes out all levels at once. Internally, each level still has its own ([vertex and] tiler and) fragment job, but all of these jobs are chained together into one (or two) job chains.
Merging generically
This "merging" can be implemented generically. In this implementation, the batch tracking logic is extended to handle these "merged" batches. Then the actual mipmap generation proceeds as usual with the common code, with appropriate overrides for batch allocation and submission. !17519 (closed) implements this idea.
The new "merging" logic complicates the already-brittle batch tracking code. I'm not comfortable with the code, despite writing it (yes, I nak'd my own patch.) It's not an aesthetic problem with the patch -- any implementation of the idea risks subtle bugs (present or future). The entire driver was written under the assumption that a batch corresponds to a single framebuffer (possibly with multiple attachments). That has implications for everything from tiler data structure lifetime management to fragment scissor box tracking. !17519 (closed) changes enough random code to get mipmap generation working on Bifrost. It's still broken under Midgard due to the different tiler data structures used.
The code there is also deceptively simple: it is not generic, the optimization
it implements is invalid in general, and it only works because it's only ever
used with util_gen_mipmap
and u_blitter
. It surely encodes internal
implementation details of those modules. That means that correct internal
changes to external components (in common code) could regress Panfrost, perhaps
in subtle ways that CI would miss. That's a significant technical debt that the
short diffstat glosses over.
The optimization would also need to be rewritten in PanVK, if the mipmap generation extension is approved in the end. This is the least persuasive point, but including it for completeness.
Merging in pan_blit
The other approach is to implement mipmap generation entirely ourselves,
generating hardcoded fragment job chains in pan_blit
. This avoids the generic
approach's maintenance hazards and "spooky action at a distance". !17520 (closed)
implements this approach by extending pan_blit with a mipmap generation hook.
Performance on Bifrost is similar between !17519 (closed) and !17520 (closed), both are
significantly faster than the generic implementation we use now.
Of course, that implementation is non-portable. It hardcodes a Bifrost draw call descriptor, it has a special Midgard-only code path (due to missing frame shaders), and it will need changes to work on Valhall. I'm not sure whether how many changes would be required for CSF. It wouldn't block future porting, at least, because we could still fallback to the slower common code like we do.
It also adds another user of the internal pan_blit
infrastructure. Once we
switch PanVK to the common Vulkan blitter, much of pan_blit
will become
unused. This might add another user. I don't find this too persuasive, though.
We already need a minimal pan_blit
implementation for tilebuffer reloads, and
generating mipmaps "looks" just like a tilebuffer reload except for the texture
descriptor used.
Beyond merging
Merging the job chains is the main optimization required for efficient mipmap generation. Skipping the vertex and tiler jobs, leaving only the fragment job for each level, is a more minor optimization. Skipping the vertex job is easy, as for any blit.
On Bifrost+, the tiler job can be skipped too, by mapping the draw to a frame shader (as it covers the entire render target). This probably doesn't save much, since the required tiling work is utterly trivial.
Strictly, we can skip on Midgard too. While Midgard doesn't support frame shaders, we can hardcode the results of the trivial tiler jobs and feed those into standalone fragment jobs, implemented here. Unfortunately, this requires the driver having knowledge of the internal tiler data structures, which are explicitly opaque to software. The problem goes beyond not having documentation -- I'm not scared of reverse-engineering. Because these are explicitly opaque and hardware managed, there are no portability guarantees whatsoever. For something this simple, that might be okay if we only care about supporting Mali-T760 and Mali-T860 on that code path. Maybe other Malis would work as well, maybe they wouldn't. Mali-T720 and Mali-T820, which lack a hierarchical tiler, likely would not. Maybe it would be okay, but it seems unwise to implement. (If I recall correctly, the Midgard DDK does not try to optimize away the tiler jobs.)
Summary
Meta sucks.