nir: Lower AMD specific shader I/O, merge shaders and handle NGG
This is the grand plan of how to unify some code which is currently duplicated between three shader compiler backends: RadeonSI/LLVM, RADV/LLVM, RADV/ACO. This plan will help us de-duplicate a lot of code, and also enable us to implement some new features, such as NGG primitive culling in NIR, without the need to specifically cater to such features in the compiler backend; in contrast to what we currently have, where each backend must be tailored to support each feature or edge case.
The following milestones list all the features that I'd like to cover. Each milestone should go into a separate MR series which implements the listed functionality and switches ACO over to them. Then the other backends can be integrated in separate merge requests.
I implemented most of these features in ACO, so I feel that it will be straightforward for me to rewrite them to NIR and adapt the ACO code base. I'd appreciate some help from the team with integrating the new NIR code with the other backends.
-
Pre-requisites: -
Unified way to handle shader arguments - Already done by Connor
-
Shader argument intrinsics - Already done by Rhys in his descriptor lowering branch.
-
-
Milestone 1/A: Lower AMD-specific shader I/O in NIR - Current status: each backend implements the NIR I/O intrinsics on its own (mostly the same way, but with subtle differences)
- Proposal: where applicable, lower NIR I/O intrinsics to memory accesses: either shared memory, or VRAM.
- Benefits, notes:
- We can get rid of all the ACO and LLVM backend code which deals with this I/O.
- Lowering to VRAM is not a requirement, but pretty easy to do, and helps us get rid of the rest of the backend specific I/O code.
- This also brings the LLVM backend up to par with ACO in some optimizations that are only implemented in ACO.
-
Implement NIR IO lowering passes (in progress) -
LS outputs -> shared memory store -
HS inputs -> shared memory load -
HS outputs, including tess factors -> shared memory access and VRAM store -
TES ES inputs -> VRAM load -
ES outputs -> shared memory store (GFX9+) / VRAM store (GFX8-) -
GS inputs -> shared memory load (GFX9+) / VRAM load (GFX8-)
-
-
Integrating the new NIR passes into the backends -
RADV/ACO -
RADV/LLVM -
RadeonSI/LLVM
-
-
Milestone 1/B (optional): De-duplicate some code regarding occupancy calculations. -
Move workgroup size calculation from ACO to RADV, or even AC. -
Move tcs_num_patches
calculation to RADV, remove it from ACO and fromradv_nir_to_llvm
. -
Try to share LSHS occupancy calculation ( get_tcs_num_patches
) between RADV and RadeonSI. -
Try to share NGG occupancy calculation between RADV and RadeonSI.
-
-
Milestone 2: Merged shaders in NIR- I decided to scratch this idea, since it seems to be more trouble than it's worth.
- Current status: all 3 backends take 2 NIR shaders as input, and bolt these shaders together
Proposal: make NIR aware of the fact that a shader can contain pieces of 2 different stages.
-
Milestone 3: Lower NGG in NIR - Current status: we don't have a direct equivalent of NGG capabilities in NIR, and pretend that it works like the traditional model. We only map the traditional model to NGG in the backend compilers.
- Proposal: define new intrinsics for NGG features (basically, mesh shader like features), then lower VS, TES and GS to this.
- Benefits, notes:
- This will de-duplicate a lot of complicated logic that currently exists in all backends.
- It brings us one step closer to supporting actual mesh shaders, because it will give us a base to which we can also lower mesh shaders in the future. (Notable exception: we don't care about generic per-primitive outputs now, among a few other things.)
-
Introduce necessary concepts to NIR - Notes about how NGG hardware works
- It is basically a strict subset of a normal mesh shader. The point of the lowering passes is to transform the VS/TES/GS (and later MS) model into NGG terms.
- The shader must know the number of exported vertices and primitives and allocate space for them before exporting them.
- Each active lane must export 0 or 1 primitive and specify the vertex indices (a vertex index is the ID of the lane in the threadgroup which will export that vertex).
- Each active lane must export exactly 1 vertex.
- If we don't do exactly what the HW expects, it will express its discontent by hanging.
- Notes about how NGG hardware works
-
NIR NGG lowering passes -
NGG VS and TES -
NGG GS
-
-
Integrate the new NIR passes into the backends -
RADV/ACO -
RADV/LLVM -
RadeonSI/LLVM
-
-
Milestone 4/A: NGG shader-based primitive culling in NIR - With all of the above in place, it becomes possible to implement primitive culling in NIR, without any need for the backends to be aware of it.
-
Milestone 4/B: NGG streamout (aka. transform feedback) in NIR - This is somewhat orthogonal, but loosely connected to the above topics. Since streamout means basically just writing data to VRAM, we can now also implement it in a NIR pass.