Triang3l (Vitaliy Kuzmin) requested to merge Triang3l/mesa:radv_pops_cheeks_of_frogs into main Apr 01, 2023

(Totally not an April Fools' Day joke, I guess 🙃)

This merge request adds an implementation of VK_EXT_fragment_shader_interlock (Primitive Ordered Pixel Shading — POPS — in AMD hardware terms), a highly demanded feature in the emulation community (including for the Xbox 360 emulator I'm working on, Xenia), and one of the requirements of Direct3D feature level 12_1.

Thanks all the froggies in the Linux Gaming Dev pond who have been helping me with understanding the architecture of NIR and ACO (this is my first non-tiny merge request), with testing, and especially @mareko for answering my extremely specific whack-a-mole questions about obscure edge cases! 🐸

The extension provides the functionality for performing arbitrary read–modify–writes from fragment shaders in the order primitives are submitted in, including in cases of multiple overlaps and intersecting primitives. Some of the usages include:

Programmable blending:
- Custom color blend equations (possibly including OpenGL advanced blend equations, although I'm not sure about those as late depth/stencil can't cancel memory writes from a shader);
- Non-color blending (such as screen-space decals applied to a normal G-buffer);
- Custom pixel encodings, especially in console emulation;
- Shader framebuffer fetch emulation (such as on the PlayStation Vita);
Order-independent transparency:
- No incoherence between frames that's present in unordered one-pass methods;
- No strict layer count limit between compositions that two-pass methods have;
- Can compose multiple times at any moment while drawing — objects can be sorted coarsely, and OIT with a small number of layers can be used just to resolve conflicts within objects, but having practically infinite layers globally;
Advanced binning, such as immediately inserting lights in a tree along the Z axis in 2.5D clustering.

Real-world software using this extension:

Translation layers over Vulkan or OpenGL:
- DXVK (Direct3D 11.3);
- VKD3D (Direct3D 12) — requires this extension to advertise feature level 12_1 support;
- Ryujinx (Nintendo Switch) — implementing interlock fragment shader interlock itself (used in Super Mario Party) and possibly coherent attachment feedback loop or at least advanced blend equations (The Legend of Zelda: Tears of the Kingdom, Xenoblade Chronicles);
- Zink (OpenGL);
Applications using it internally:
- Play! PlayStation 2 emulator (Vulkan) — guest output-merger emulation (custom blending equations, dithering, and other things);
- Vita3K PlayStation Vita emulator (Vulkan, OpenGL) — shader framebuffer fetch emulation;
- Xenia Xbox 360 emulator (Vulkan, --gpu=vulkan --render_target_path_vulkan=fsi, or Direct3D 12, --gpu=d3d12 --render_target_path_d3d12=rov) — guest output-merger emulation (custom pixel formats);
Games:
- A Plague Tale: Requiem (Direct3D 12) — G-buffer decal blending;
- GRID 2 (Direct3D 11, though appears to be Intel's PixelSync extension rather than ROVs at least in the initial release) — foliage and cutout surface order-independent transparency, particle shadow mapping;
- Just Cause 3 (Direct3D 11) — foliage order-independent transparency;
- Super Mario Party (Nintendo Switch) — appears to be clustering of lights;
Samples:
- Nvidia's vk_order_independent_transparency (Vulkan) — has both PixelInterlock and SampleInterlock cases with 1, 4 and 8 samples;
- Intel's Adaptive Order Independent Transparency (Direct3D 11);
- Nvidia's Normal Blended Decal Sample (OpenGL);
- Christoph Neuhauser's Flow and Stress Line Visualization (Vulkan) and OIT Rendering Tool (OpenGL).

Implementation overview

This merge request implements the feature ONLY ON THE ACO PATH, not on LLVM. For this reason, it doesn't include a GL_ARB_fragment_shader_interlock implementation as RadeonSI currently doesn't support ACO. The upstream LLVM code currently doesn't have any facilities for POPS, and even has some assertions preventing the use of the concepts involved. Because this is a complex feature involving control flow and scheduling changes, I don't want to create an interface in LLVM that would diverge too much from AMDGPU-PRO uses internally. Instead, I want to wait until AMD submit their interface to the LLVM upstream, and to create the necessary NIR-to-LLVM conversion using it later.

The implementation is based on three sources:

Disassembly of shaders using Direct3D 12 Rasterizer-Ordered Views from the Radeon GPU Analyzer;
Instruction set architecture documentation, including the (slightly mismatching when it comes to the bit layout, but conceptually correct) documentation of the "POPS collision wave ID" argument SGPR;
AMD Platform Abstraction Layer for part of register setup;
Chat with Marek Olšák regarding non-obvious details of the setup.

Most of the shader logic behind how this feature works is explained in docs/drivers/amd/hw/pops.rst.

ACO changes

On the ACO side, I wanted to provide balance between the use of pseudo-instructions for simplicity and actual usage of the overall compiler infrastructure, especially since the code involves a loop. Most of the wait itself, however, is done purely in instruction selection, with the exception of one instruction that accesses src_pops_exiting_wave_id itself, and one pseudo-instruction that's not lowered to any instructions and used purely as a location marker for other areas of the compiler.

The Vega–RDNA2 overlapped wave awaiting code is pretty alien to the usual scheduling architecture on multiple levels:

It's just a bunch of free-standing mostly-SALU instructions, essentially dead code.
The code rechecks the value of the same operand, src_pops_exiting_wave_id, so it may be tempting to assume that its value stays the same — while in reality it's volatile.

What the wait code does is:

s_setreg(HW_REG_POPS_PACKER, POPS_EN(1) | PACKER_ID(packer_sgpr));
while ((src_pops_exiting_wave_id + remap_wave_id_to_monotonic_offset_sgpr) <=
       newest_overlapped_wave_id_sgpr) {
   s_sleep(gfx_level >= GFX10 ? 0xFFFF : 3);
}

(Note that the comparison and the sleep are not atomic, but it's what AMD's own compiler emits, so this probably should be fine — I guess the wakeup signal stays set until the next sleep? Though I don't know how it works in the hardware, of course.)

To make src_pops_exiting_wave_id volatile, I created a new pseudo-instruction that's lowered to s_add_i32 with src_pops_exiting_wave_id being one of the operands, which is explicitly excluded from value numbering. Other instructions involved in the wait don't have result definitions, only side effects, so they can't be eliminated.

A new rule was added to the scheduling so instructions involved in waiting for the completion of overlapped waves can't be reordered upwards, because we want the critical section to be as short as possible, and thus the wait should be close to it.

To make sure memory accesses are not moved before the wait, a new p_pops_gfx9_overlapped_wave_wait_done pseudo-instruction was added, which is treated as a queue family scope acquire barrier (implementing the acquire semantics of OpBeginInvocationInterlockEXT across fragments defined by the Vulkan specification).

The end of the critical section on Vega–RDNA2 is indicated by the p_pops_gfx9_ordered_section_done pseudo-instruction, which is a queue family scope release barrier (representing the release barrier between fragments implied by OpEndInvocationInterlockEXT). It's not reorderable downwards, to make sure we exit the critical section as early as possible.

To make sure MSG_ORDERED_PS_DONE is sent on every execution path after the packer has been set (though it's okay to send it if you aren't going to set the packer at all, but not necessary), for simplicity, because ACO shaders are single-return (with the exception of the early exit if all fragments in the wave as discarded), and because ACO doesn't seem to have the sufficient post-dominance metadata (although I'm not sure), it's always sent from the top control flow level (top-level blocks are also indexed in execution order, which makes it easy to see where each point in the shader is located relatively to it) — from a location in a top-level block that is after all p_pops_gfx9_ordered_section_done instructions.

There's one exception from the single-return rule, and that is discard early exit. If it's done before the MSG_ORDERED_PS_DONE is sent, a slightly different version of the discard early exit block is used, which does s_sendmsg(MSG_ORDERED_PS_DONE) before exiting.

The s_sendmsg(MSG_ORDERED_PS_DONE) forces waiting for vmcnt(0) & vscnt(0) so overlapping waves are not resumed while there are still memory accesses in flight. During the normal shader flow, the waits are inserted if there are outstanding requests. In the discard early exit block, the s_waitcnt is placed unconditionally, because the counters are not properly tracked there.

If there are SMEM loads from buffers or global memory, lgkmcnt(0) is also awaited before exiting the critical section. Naturally, those shouldn't be happening, as fragment shader interlock synchronizes only within a single screen pixel, so the address will not be the same for different lanes in a wave, as that would lead to different pixels accessing the same memory, a case not synchronized by fragment shader interlock. But nothing (other than sanity if it's not lost) can stop an application from, for example, placing a memory access in an address scalarization loop (and ending up with 64 iterations of it executed) — this makes no sense, but this should be handled safely by this implementation.

On RDNA 3, things are done slightly simpler. The s_wait_event export_ready is inserted for begin_invocation_interlock just like the sendc on Intel, and s_wait_event export_ready is treated as a queue family scope acquire barrier.

Because on RDNA 3, the critical section is exited by performing an export with the done flag, in many cases it's not possible to unlock the mutex in the middle of the shader. If the shader doesn't write any outputs — a common case of fragment shader interlock usage — it might have been possible to place the done export earlier, after the end of the critical section. However, that's out of scope of this merge request for simplicity — that would need some wider architectural changes, which may be done later in a separate merge request if anything would really benefit from that (which is unlikely, as in shaders using fragment shader interlock but not outputs, usually the whole purpose is just to write something to the rasterizer-ordered resources, and some unordered access afterwards would probably be uncommon, though it may be used if accessing something indirectly while only ordering access to indices/pointers). But anyway, if the shader uses POPS, now a queue family release barrier is placed before the exports, and the done export also waits for vmcnt(0) & vscnt(0).

Context setup

The main settings are in the DB_SHADER_CONTROL register. There, POPS is enabled. But in addition, there are very important differences from PAL here — though I haven't seen any issues with them so far in multiple scenarios with different architecture revisions, sample counts, and the presence of framebuffer attachments, so I think those are safe — and I absolutely don't want to remove one of them unless it's known to be fatally broken.

The difference from PAL is that for SampleInterlock, I'm setting POPS_OVERLAP_NUM_SAMPLES (or OVERRIDE_INTRINSIC_RATE on RDNA 3) to the number of rasterization samples (more precisely MSAA_EXPOSED_SAMPLES, the number of bits in gl_SampleMaskIn, which shaders use to see whether they're allowed to access the data for each of the samples) — unlike PAL, which doesn't set that field at all (leaving it 1x, resulting in PixelInterlock behavior).

With PixelInterlock (POPS_OVERLAP_NUM_SAMPLES = log2(1)), when multisampling is used, adjacent primitives cover the same pixels along their common edge, even though they don't have any sample coverage is common, and that results in fragment shader invocations being interlocked there too. This results in abysmal performance. In nvpro-samples/vk_order_independent_transparency, with the default parameters of objects, I'm getting the following frame times on the RX Vega 10 (Raven Ridge):

No MSAA: 26 ms;
4x sample shading and SampleInterlock: 71 ms;
4x pixel shading and PixelInterlock: 875 ms.

On the RX 7900 XT (RDNA 3), the situation isn't much better:

No MSAA: 1.0 ms;
4x sample shading and SampleInterlock: 2.8 ms;
4x pixel shading and PixelInterlock: 29.0 ms.

(Reproducibility note: The numbers are slightly outdated here, they became around 12% smaller after removing intrawave collision handling, but the ratios are still the same.)

This is a massive advantage of Vulkan and OpenGL fragment shader interlock over Direct3D, GL_INTEL_fragment_shader_ordering and Metal here, that it allows you to request SampleInterlock with pixel-rate shading (or also vice versa) explicitly. The Xenia emulator (which does per-sample programmable blending and pixel-rate shading) uses SampleInterlock on Vulkan, and on AMD, it's much faster on this Vulkan implementation than on Direct3D (when it doesn't crash on Direct3D) thanks to the ability to use SampleInterlock.

Also, unlike PAL, I'm enabling EXEC_IF_OVERLAPPED. I don't know what it does, and it makes no performance difference in the cases that I've tried, but they had the majority of the shader in the critical section, so that's possibly why. But from its name, it somewhat suggests that it allows fragment shaders for overlapping invocations to run even if the overlapped ones haven't exited yet, which is largely the point of that wait loop? I have no idea. This is something that I need AMD's information on. But if that's the case, yes, I want to keep it too unless it's totally broken, so if the shader just has a small critical section in the end (for programmable blending, order-independent transparency list appending), I totally want the independent part (which may be huge and include things like lighting) to run in parallel whenever it's possible.

The POPS_DRAIN_PS_ON_OVERLAP hang workaround (PAL waMiscPopsMissedOverlap) for Vega 1.0 and Raven Ridge is now enabled conditionally similarly to how that's done in PAL (previously RadeonSI and RADV somewhat reserved it for future by always enabling it, but the former doesn't provide POPS currently, the latter handles it properly now), for 8x+ AA. It drops the performance by around 1.8x when 8x+ MSAA is used, but it's necessary for avoiding a hang found by AMD.

(Moved to !22375 (merged)) I also changed what DB_Z_INFO::NUM_SAMPLES is set to on RDNA 3 if no depth/stencil buffer is bound to match PAL — previously it was max(color samples, depth/stencil samples), or 1x without attachments, now it's MSAA_EXPOSED_SAMPLES (matches the rasterization sample count). This also has effect on occlusion queries (possibly determines the number of samples counted per pixel) and variable shading rate. I'm not entirely sure how correct that is from the perspective of all edge cases in how multisampling is exposed in Vulkan, both in no-attachment and color-only cases, plus the Bresenham line drawing workaround which forces the rasterization sample count to 1, not sure how occlusion queries and POPS, for example, should count them, and obviously what gl_SampleMaskIn should receive. But, without this change, all CTS fragment shader interlock tests using MSAA fail. I haven't touched the same parameter in RadeonSI though, but it may need tweaking as well, though I'm not entirely sure how target-independent rasterization is supposed to work in OpenGL.

Using POPS also enables a dummy 32_R color export if the shader doesn't need any real exports. Without that, POPS just hangs on RDNA 3, and doesn't synchronize invocations properly (failing CTS tests) on RDNA 2 (tested by bluestang) and likely RDNA 1.

I also tried setting PA_SC_BINNER_CNTL_0::BIN_MAPPING_MODE to BIN_MAP_MODE_POPS when the context uses POPS, hoping for a better rasterization pattern and higher performance. However, it made no difference for me, so I decided not to touch the binner settings at all (especially because I don't know when it's safe to change them). PAL also doesn't use BIN_MAP_MODE_POPS.

With variable shading rate, POPS forces the shading rate to 1x1, so fragmentShadingRateWithFragmentShaderInterlock and fragmentShaderShadingRateInterlock are both unsupported.

This implementation makes no difference between the ordered and unordered (mutex-only) interlock modes exposed by Vulkan and OpenGL, always providing the ordered variant. Initially, I wanted to permit out-of-order rasterization for unordered interlock, but during the development, automatic out-of-order rasterization enabling was removed from RADV completely. I'm not sure, however, how compatible POPS is with out-of-order rasterization in general. But since POPS is primarily a Depth Block feature, I think it should treat out-of-order rasterization largely the same as the usual depth/stencil test does? But I have absolutely no idea.

Test results

The implementation was tested in the following scenarios:

RX Vega 10 (Vega Raven Ridge), by me:
- dEQP-VK.fragment_shader_interlock.*;
- nvpro-samples/vk_order_independent_transparency;
- Xenia;
- A custom test for SampleInterlock with variableMultisampleRate (pixel shader code).
RX 6800 XT (RDNA 2 Navi 2.1), by bluestang:
- dEQP-VK.fragment_shader_interlock.*;
RX 7900 XT (RDNA 3 Navi 3.1), by me:
- dEQP-VK.fragment_shader_interlock.*;
- Some other dEQP must-pass tests;
- nvpro-samples/vk_order_independent_transparency;
- Xenia.

On all the tested configurations, the test results for dEQP-VK.fragment_shader_interlock.* are:

Passed: 384 (PixelInterlock and SampleInterlock);
Failed: 0;
Not supported: 192 (ShadingRateInterlock).

Edited Jun 21, 2023 by Triang3l (Vitaliy Kuzmin)

Admin message

radv: Fragment shader interlock implementation

Implementation overview

ACO changes

Context setup

Test results

Merge request reports