Cunning plan for lowered I/O

I'd like to start moving all of NIR towards nir_io_semantics and away from the legacy driver_location and nir_intrinsic_base(). This is a bit of a big project and is going to need buy-in from various driver maintainers. A bunch of drivers are already ready but there's still a handful that will need non-trivial work.

Motivation

Right now, we have basically 4 kinds of I/O in NIR and it's at least one too many:

Variables: Everything is done with variables and load/store_deref
Semantic I/O: This is load/store_input/output where nir_io_semantics is used and everything is in units of API locations.
Generic lowered I/O: The driver decides on locations and type_size() but it still uses load/store_input/output.
Driver I/O: Lowered I/O using driver-specific intrinsics

@mareko has been doing a bunch of I/O reworks lately and improvements to the linking helpers. All of that assumes semantic I/O. Meanwhile, many drivers aren't using nir_io_semantics and are instead doing something weird and custom. The real irony is that what they're doing is usually pretty equivalent to nir_io_semantics, just different in some random detail. Because semantic I/O and generic lowered I/O use the same intrinsics, it's a constant guessing game as to which one is in play in any give NIR pass.

Meanwhile, Vulkan has been moving more and more in the direction of explicit I/O at the API level. There has even been chatter about trying to give cross-stage vertex I/O that same treatment. There's no extension for this and I wouldn't be able to talk about it here if there were but my reading of the tea leaves says it might be coming. In light of that, I'd like NIR to be ready. I'd also like to have a good feeling of what "the NIR plan" is so that I can talk to it if and when any discussions of that nature do come up in earnest.

Thirdly, and most importantly, this is an area in which Mesa has needlessly diverged. There is a whole lot of "sounded like a good idea at the time" so I'm not going to point fingers. I resisted nir_io_semantics for a long time so I'm definitely to blame for some of it. However, at this point it's pretty clear that we have two or three paths and they all work but we really don't need them all. The more passes that get added for optimizing and otherwise dealing with I/O, the more painful things become.

Why not `driver_location`?

I think that's better answered by trying to first answer the question, "Why driver_location"? When @cwabbott0 and I first brought up NIR, the idea was that hardware has some sort of I/O space and that driver_location would be a location in that hardware I/O space. Drivers get variables with locations and would assign driver_location and then let nir_lower_io() give them these load/store_input/output intrinsics which are "nicer" than variables. It wasn't a fundamentally terrible plan.

Then someone made gallium call nir_lower_io() and that plan got totally shot to hell.

Not that I'm actually complaining. I was pretty annoyed by it for a while but I've since come to the conclusion that nir_io_semantics really is the only sane way to do any of this. That or variables and variables suck.

So why not driver_location? Didn't I say it was an okay plan? Well, yes but also no. The problem isn't really with driver_location but with load/store_input/output themselves. When I originally brought up NAK, I went all in on driver_location. NVIDIA hardware is basically the perfect hardware for the driver_location model. It has a unified I/O space where everything except a handful of system values lives. Everything is addressed in bytes. There are no special instructions for misc. values. It's perfect. The problem is that load/store_input/output are just not what the back-end wants to consume. They're way too clunky. I ended up adding ald/ast/ipa_nv intrinsics which map better to what the hardware wants and a custom NIR lowering pass to lower load/store_input/output to those. At that point, whether I do the mapping to HW locations via driver_location or directly in my lowering pass doesn't really matter. driver_location gains me nothing.

So I converted NAK to nir_io_semantics last week.

The other place the driver_location plan fell apart is that we originally intended it to be used for all sorts of things. Vertex I/O, uniforms, shared, and anything else where the driver needs a location. The majority of those have been moved over to nir_lower_explicit_io at this point. The only things left actually using nir_lower_io are GL uniforms (which are immediately moved to cbuf0 with gallium), vertex I/O, and maybe shared in a few GL edge cases. This great generic system we created is neither particularly generic nor is it all that great.

The plan

So what I'd like to do is to unify all of NIR on nir_io_semantics for vertex I/O and figure out something for uniforms. I'm honestly not sure how uniforms are still a thing in the gallium world but I see nir_lower_io paired with nir_var_uniform in the code and haven't done a detailed enough analysis to figure out why.

To that end, I think we need to do roughly the following:

Add a nir_lower_io_semantics() helper (please, someone help me come up with a better name) which sets driver_location = -1 on everything and always runs on nir_var_shader_in | nir_var_shader_out with a fixed vec4 type_size callback.
One by one convert drivers to nir_lower_io_semantics(). Because driver_location = -1, they'll have to use nir_io_semantics and ignore nir_intrinsic_base() on all load/store_input/output intrinsics. I'm fairly sure ACO, AMD LLVM, NAK, panfrost, asahi, and a few others are probably already good to go and should be trivial.
Something, something nir_var_uniform. I think 90% of this can be done by overloading location and making nir_lower_uniforms_to_ubo(). If we do still want nir_lower_io for uniforms for some drivers, we can add a nir_lower_uniform_io() helper which restricts things down to just what we want for those cases.
Delete the BASE, RANGE, and COMPONENT const indices from load/store_input/output

I 100% recognize that the above plan is probably incomplete. As we go, we can add to it as we figure out what needs to be done in more detail.

Edited Apr 03, 2024 by Faith Ekstrand

To upload designs, you'll need to enable LFS and have an admin enable hashed storage. More information

Admin message

Cunning plan for lowered I/O

Motivation

Why not driver_location?

The plan

Why not `driver_location`?