panfrost: Plan a strategy for better meta operations

Both our Gallium and Vulkan drivers need to implement a number of "blit-like" operations. There are a lot of operations to worry about, there's a different set for each API, and there are lots of potential implementations with different performance/maintainability trade-offs. So far we've been implementing things in a fairly "ad hoc" manner and it's gotten out of hand (especially on the Vulkan side). Moving forward, we need a plan.

This issue attempts to explain the problem space so we can figure out solutions.

Blits

Filtered blit operations (OpenGL: glBlitFramebuffer. Vulkan: vkCmdBlitImage). The semantics for blits can be complicated. As Mali does not have blitting hardware, blits need to be translated to draws.

In Gallium, we use u_blitter for the translation. This works very well, and rarely deserves more thought.

In Vulkan, we roll our own implementation in pan_blit. This code is complicated and non-portable (hardcoding lots of Mali-specific details). The code is only used by PanVK. Porting PanVK to Valhall is blocked on (among others) that port.

In theory, this might get slightly better performance than u_blitter, as u_blitter uses a vertex shader which PanVK skips by issuing a tiler job directly, supplying the varyings ahead of time. In practice, this is a minor optimization only. It also turns out to be possible to implement with u_blitter, by implementing the draw_rectangle callback ourselves to skip the vertex shader. This depends on some internal implementation details of u_blitter (namely that the vertex shader is always passthrough), but works okay in practice. By leveraging existing pan_blitter and Gallium driver code, this isn't too invasive. !17499 (closed) implements a proof-of-concept of this optimization.

@jekstrand is working on common meta code. When !17438 (closed) is finished and merged, we plan to switch PanVK to use it and drop the relevant code paths in pan_blit, resulting in a u_blitter-like situation. Like `u_blitter, this common Vulkan implementation will allow the driver to override the "rectangle draw" call, so we can implement our Mali tricks here too.

Clear image

Clear to an arbitrary image (Vulkan: vkCmdClearColorImage.) Nonportable implementation in PanVK. To be replaced by !17438 (closed).

Clear attachment

Clear to a bound attachment (OpenGL: glClear. Vulkan: vkCmdClearAttachments.) u_blitter implementation in Gallium since !17112 (merged). Mostly equivalent nonportable implementation in PanVK. To be replaced by !17438 (closed).

For depth/stencil clears in particular, we could do a bit better on Valhall with "full screen draws", a new job type with the property that the hardware can skip adjacent full screen draws with the same draw call descriptor. In other words, full screen draws must be idempotent. This implements a minor optimization for attachment clears and makes them otherwise unsuitable as a general fullscreen draw primitive. It's unclear whether this actually makes a difference even for depth/stencil clears... couldn't we detect that pattern on the CPU anyway?

Fill buffer

Fill a buffer with a particular pattern (OpenGL: glClearBufferSubData. OpenCL: clEnqueueFillBuffer. Vulkan: vkCmdFillBuffer). This corresponds to the Gallium hook clear_buffer. When clear_buffer isn't implemented, mesa/st provides a software fallback. Clover/Rusticl do not, so we implement clear_buffer with u_default_clear_buffer (since !16044 (merged)), which is another software fallback.

Software fallbacks here have performance issues. Filling should instead be implemented as a compute shader. @mareko proposed that we implement "something like u_blitter that uses compute shaders to implement clears and blits" at the Gallium level, which may be important for OpenCL performance. Other than the specific vector sizes required for optimal throughput, Mali has no 'specific' performance requirements here. If/when a common compute shader implementation is merged, it will be trivial to switch the Gallium driver.

On the Vulkan side, PanVK implements nonportably. To be replaced by !17438 (closed).

Copies

Various copy operations are required by APIs (Vulkan: vkCmdCopyImage, vkCmdCopyImageToBuffer, etc. OpenGL: glCopyImageSubData, glCopyBufferSubData, etc.)

In Gallium, these get mapped to the resource_copy_region hook, which currently falls back to software in util_resource_copy_region. In addition to being slow, this conflicts with AFBC (framebuffer compression that we apply even to textures that are otherwise uncompressed). We've disabled ARB_copy_image in the mean time because that extension was a mistake, but I digress.

In Vulkan, these are implemented nonportably with a complicated mix of compute and fragment shaders. The manual descriptor packing should be easy enough to replace with meta operations, which will remove the main source of portability problems across Malis (and unblock basic Valhall/PanVK support). Due to AFBC, it's not clear to me whether copy operations can be implemented generically (and efficiently) on the GPU, even with !17438 (closed). Here are some considerations that make this tricky:

Using the OpenGL meanings of the terms, textures and framebuffers support AFBC but shader images do not. As soon as a texture is bound as a shader image (for read or write), the Gallium driver triggers a decompression blit and then AFBC is lost for the rest of the resource's lifetime. PanVK doesn't support AFBC yet but when it will, it won't be able to use AFBC for resources with the writeable image usage flags. This implies that any operations writing to an image need to happen with fragment shaders. (vkCmdCopyImageToBuffer should be a compute shader that uses texelFetch and an SSBO write, whereas vkCmdCopyBufferToImage should be a fragment shader that uses an SSBO read and a gl_FragColor write.)
AFBC depends on the specific pixel format. It has transforms that depend on the colour channels themselves, it doesn't simply treat pixels as a "bag of bits" like the copy routines do. Copying between incompatible AFBC formats (R8G8 <---> R5G6B5 for example) requires some nasty bit packing in the shader, as PanVK implements. These cases should be rare and could be solved by copying to an intermediate (possibly uncompressed) "staging" resource, which is the approach turnip takes. Unlike with Gallium, they shouldn't be solved by disabling AFBC.
Starting with Valhall, the driver can specify the physical AFBC format separate from the logical pixel format accessed by the shader. This feature allows implementing texture views efficiently, trivializing problem #2 (closed) on Valhall provided the functionality can be piped through the driver. This is not a "get out of jail free" card if we want to continue supporting Bifrost in PanVK.

I am hopeful that !17438 (closed) will help some of these problems for PanVK, but I suspect we'll still need some driver magic. We'll see what @jekstrand can cook up in common.

I don't know if we need to support this (efficiently or at all) in the GL driver. For now I am happy to keep this as a Vulkan-specific craziness. If we do need it, we'll cross that bridge when we get there.

Tilebuffer reloads

Mali GPUs are tilers. At the start of a render pass, the tilebuffer is always cleared. APIs sometimes require preserve framebuffer contents (OpenGL: no glClear. Vulkan: loadOp = LOAD_OP_LOAD). The driver must insert a fragment draw at the beginning of the render pass that binds the existing framebuffer contents as a texture, samples them, and writes the results into the tilebuffer.

The required pipeline is similar to "regular" blits. However, it may be dispatched specially, as it must precede all other draws (OpenGL: driver does not currently know whether the reload is necessary until the end of the frame, although this could be reworked. Vulkan: unnecessary.) On Midgard, this is handled by a special "inject" option in pan_scoreboard. On Bifrost+, this is handled by a pre-frame shader that is dispatched. Tilebuffer reloads are the only current users of frame shaders.

Reloads are currently implemented in pan_blit. This code is entirely Mali-specific (with significant portability challenges for new hardware, including v10). The implementation is shared between APIs.

Mipmap generation

The application may supply a base image and require the driver to generate a full miptree (OpenGL: glGenerateMipmap. Vulkan: not available, draft extension was abandoned due to lack of interest.)

Gallium has generic code (util_gen_mipmap) to translate the generate_mipmap calls to a series of blit calls for each level, which Panfrost uses. Then the blits are implemented with ([vertex and] tiler and) fragment jobs with u_blitter.

The trouble with this approach is that each blit has a different destination, corresponding to a different FBO and hence a different panfrost_batch. So if we need to generate N levels of mipmaps, we need (at least) N job chains. Under our current UABI, that involves (at least) N roundtrips to the kernel. That is a lower bound, the actual performance is substantially worse due to the pervasive assumption that batches are large and that we can amortize expensive operations like memory allocation over the entire batch.

Generating mipmaps efficiently requires "merging" these batches into a single "superbatch" that writes out all levels at once. Internally, each level still has its own ([vertex and] tiler and) fragment job, but all of these jobs are chained together into one (or two) job chains.

Merging generically

This "merging" can be implemented generically. In this implementation, the batch tracking logic is extended to handle these "merged" batches. Then the actual mipmap generation proceeds as usual with the common code, with appropriate overrides for batch allocation and submission. !17519 (closed) implements this idea.

The new "merging" logic complicates the already-brittle batch tracking code. I'm not comfortable with the code, despite writing it (yes, I nak'd my own patch.) It's not an aesthetic problem with the patch -- any implementation of the idea risks subtle bugs (present or future). The entire driver was written under the assumption that a batch corresponds to a single framebuffer (possibly with multiple attachments). That has implications for everything from tiler data structure lifetime management to fragment scissor box tracking. !17519 (closed) changes enough random code to get mipmap generation working on Bifrost. It's still broken under Midgard due to the different tiler data structures used.

The code there is also deceptively simple: it is not generic, the optimization it implements is invalid in general, and it only works because it's only ever used with util_gen_mipmap and u_blitter. It surely encodes internal implementation details of those modules. That means that correct internal changes to external components (in common code) could regress Panfrost, perhaps in subtle ways that CI would miss. That's a significant technical debt that the short diffstat glosses over.

The optimization would also need to be rewritten in PanVK, if the mipmap generation extension is approved in the end. This is the least persuasive point, but including it for completeness.

Merging in pan_blit

The other approach is to implement mipmap generation entirely ourselves, generating hardcoded fragment job chains in pan_blit. This avoids the generic approach's maintenance hazards and "spooky action at a distance". !17520 (closed) implements this approach by extending pan_blit with a mipmap generation hook. Performance on Bifrost is similar between !17519 (closed) and !17520 (closed), both are significantly faster than the generic implementation we use now.

Of course, that implementation is non-portable. It hardcodes a Bifrost draw call descriptor, it has a special Midgard-only code path (due to missing frame shaders), and it will need changes to work on Valhall. I'm not sure whether how many changes would be required for CSF. It wouldn't block future porting, at least, because we could still fallback to the slower common code like we do.

It also adds another user of the internal pan_blit infrastructure. Once we switch PanVK to the common Vulkan blitter, much of pan_blit will become unused. This might add another user. I don't find this too persuasive, though. We already need a minimal pan_blit implementation for tilebuffer reloads, and generating mipmaps "looks" just like a tilebuffer reload except for the texture descriptor used.

Beyond merging

Merging the job chains is the main optimization required for efficient mipmap generation. Skipping the vertex and tiler jobs, leaving only the fragment job for each level, is a more minor optimization. Skipping the vertex job is easy, as for any blit.

On Bifrost+, the tiler job can be skipped too, by mapping the draw to a frame shader (as it covers the entire render target). This probably doesn't save much, since the required tiling work is utterly trivial.

Strictly, we can skip on Midgard too. While Midgard doesn't support frame shaders, we can hardcode the results of the trivial tiler jobs and feed those into standalone fragment jobs, implemented here. Unfortunately, this requires the driver having knowledge of the internal tiler data structures, which are explicitly opaque to software. The problem goes beyond not having documentation -- I'm not scared of reverse-engineering. Because these are explicitly opaque and hardware managed, there are no portability guarantees whatsoever. For something this simple, that might be okay if we only care about supporting Mali-T760 and Mali-T860 on that code path. Maybe other Malis would work as well, maybe they wouldn't. Mali-T720 and Mali-T820, which lack a hierarchical tiler, likely would not. Maybe it would be okay, but it seems unwise to implement. (If I recall correctly, the Midgard DDK does not try to optimize away the tiler jobs.)

Summary

Meta sucks.

Edited Jul 13, 2022 by Alyssa Rosenzweig

To upload designs, you'll need to enable LFS and have an admin enable hashed storage. More information