GP complex instruction results cannot be spilled/moved
I couldn't get exp2/log2 to work, so I started to reverse-engineer a little bit what these magic complex opcodes are doing.
Overall, it's actually quite similar to what's described here for nvidia.
complex2 multiplies the inputs, combined with adding a strange offset sometimes (I coudln't figure out why), so with the way the blob uses it it's effectively squaring the input. Each of the complex opcodes lookup polynomial coefficients in a different table, and
complex1 computes the rest of the polynomial and does the output exponent correction. I suspect that the table entries are more than 32 bits, and that the two different
complex1 sources actually receive two different parts of the table entry.
postlog2 convert to/from a fixed-point format which makes doing the exponent correction easier (again similar to nvidia). I suspect there are similar shenanigans going on with
preexp2 since in my tests it sometimes would return identical values for two different inputs, hence probably different uses of
preexp2 are getting different values to compensate for 32 bits not being enough. I haven't gotten the details nailed down, but I don't think we really have to.
Now, from this description, it should be clear that
preexp2 and the table-lookup opcodes are doing something quite weird. There's the further issue that
complex1 produces something that isn't supposed to be interpreted as a floating-point value in log2 mode, it's a fixed-point value that's supposed to be post-processed by
postlog2. So sometimes it produces what would be an "invalid" floating-point value that would never be produced otherwise, i.e. either a denormalized value or a NaN with a non-standard payload. These get flushed to 0 and the standard NaN respectively when you try to do anything floating-point-y, and since a move in the add or mul slots is just adding -0 or multiplying by 1 respectively, a move between
postlog2 will break things. And of course, the same issue exists with a move between
preexp2 and anything, and a LUT opcode and anything. And
preexp2 and LUT opcodes are already magically producing multiple values anyways.
So, there are a few nodes we absolutely can't insert a move after:
complex1when consumed by
Technically we can for
complex2, but since
complex2 sometimes has
preexp2 as a source it sometimes has to be scheduled right before
complex1. All in all, we almost always have to make sure that these instructions occur in the same exact sequence they do in the blob.
Some of these nodes we can easily guarantee to succeed if we schedule them first, namely
preexp2 (it's always a max node when scheduled that doesn't increase register pressure) and
*_impl (it's in the complex slot, hence unaffected by max-node reservations). We're not so lucky with
complex2, but I think we can add some extra reservation logic so that when we schedule
complex1 we reserve an extra next-max slot to be used by
complex2. The biggest problem is guaranteeing
complex1 can succeed, which seems quite difficult. Maybe a better way would be to first try to schedule it, and then if it doesn't succeed, turn the
postlog2 into a move, put
postlog2 back on the ready list, and carry on to try again.