Build a system for pattern-matching derefs
The idea here is to build something sort-of like nir_opt_algebraic
only for derefs. Now that NIR supports casts, there are a lot of nasty cases that can come up and we don't have a good way to solve them. All NIR can do today is throw up its hands and walk away the moment it sees an interesting deref chain. We do have some optimizations today for removing trivial casts or up/down-casts of structs but they're entirely hand-rolled and very painful to write.
This is going to be particularly important in the OpenCL world as LLVM seems to be a big fan of creating these interesting use-cases even when they didn't exist in the original kernel code. For example, consider the following (this is a real example from an OpenCL shader):
uint64_t x = 0x100000002;
uint32_t y = ((uint32_t[2])&x)[0];
In the original kernel source, x
was a uint32_t[2]
but LLVM "helpfully" turned it into a 64-bit value so it could write the whole thing with one instruction. In this case, we'd like to be able to turn this into something we can eventually constant-fold. One option would be to modify the second line like so:
uint64_t x = 0x100000002;
uint32_t y = unpack_64_2x32_split_x(x);
Technically, this new second line reads more data than necessary and that might be a problem if there were any danger of faulting. However, x
is a local variable of type uint64_t
so we know the memory is there.
Another example which I haven't actually seen come up in a shader but I've written it several times in C would be this:
T *x = foo();
T *y = (T *)(&((char *)x)[i * sizeof(T)]);
or
T *x = foo();
T *y = (T *)((uintptr_t)x + i * sizeof(T));
which we would rather turn into
T *x = foo();
T *y = &x[i];
We have to be a bit careful with the second of these examples because integer math may roll-over in unexpected ways and it's possible that the roll-over was intended in some sense. However, there may still be cases in which we can do such a transformation legally, especially if the math in question is decorated no_signed_wrap
.
My feeling is that what we really need for this isn't a more powerful nir_opt_copy_prop_vars
but rather a peep-hole optimization on derefs or, more specifically, a giant pile of such peep-hole optimizations. This leads me down the road of wondering if we can come up with a meta-language for it similar to nir_opt_algebraic only for derefs. Such a language would need to support a number of things:
- All types of derefs
- Some sort of generic types
- Some limited ALU so we can see and generate at least add, subtract, multiply, and divide.
- Likely, it also needs to be able to detect shifts which implement multiply or divide.
- Loads and stores and the ability to replace a load or store with one in which the data is wrapped in some ALU.
I don't have any sort of sense yet for what the meta-language should look like. The idea is still forming. I'm happy to take suggestions or for someone else to drive it if they think they know how to do it. For now, it'll sit on the wishlist of things I give a brain cell or two to every once in a while.