GP complex instruction results cannot be spilled/moved
I couldn't get exp2/log2 to work, so I started to reverse-engineer a little bit what these magic complex opcodes are doing.
Overall, it's actually quite similar to what's described here for nvidia. complex2
multiplies the inputs, combined with adding a strange offset sometimes (I coudln't figure out why), so with the way the blob uses it it's effectively squaring the input. Each of the complex opcodes lookup polynomial coefficients in a different table, and complex1
computes the rest of the polynomial and does the output exponent correction. I suspect that the table entries are more than 32 bits, and that the two different complex1
sources actually receive two different parts of the table entry. preexp2
and postlog2
convert to/from a fixed-point format which makes doing the exponent correction easier (again similar to nvidia). I suspect there are similar shenanigans going on with preexp2
since in my tests it sometimes would return identical values for two different inputs, hence probably different uses of preexp2
are getting different values to compensate for 32 bits not being enough. I haven't gotten the details nailed down, but I don't think we really have to.
Now, from this description, it should be clear that preexp2
and the table-lookup opcodes are doing something quite weird. There's the further issue that complex1
produces something that isn't supposed to be interpreted as a floating-point value in log2 mode, it's a fixed-point value that's supposed to be post-processed by postlog2
. So sometimes it produces what would be an "invalid" floating-point value that would never be produced otherwise, i.e. either a denormalized value or a NaN with a non-standard payload. These get flushed to 0 and the standard NaN respectively when you try to do anything floating-point-y, and since a move in the add or mul slots is just adding -0 or multiplying by 1 respectively, a move between complex1
and postlog2
will break things. And of course, the same issue exists with a move between preexp2
and anything, and a LUT opcode and anything. And preexp2
and LUT opcodes are already magically producing multiple values anyways.
So, there are a few nodes we absolutely can't insert a move after:
preexp2
*_impl
-
complex1
when consumed bypostlog2
Technically we can for complex2
, but since complex2
sometimes has preexp2
as a source it sometimes has to be scheduled right before complex1
. All in all, we almost always have to make sure that these instructions occur in the same exact sequence they do in the blob.
Some of these nodes we can easily guarantee to succeed if we schedule them first, namely preexp2
(it's always a max node when scheduled that doesn't increase register pressure) and *_impl
(it's in the complex slot, hence unaffected by max-node reservations). We're not so lucky with complex2
, but I think we can add some extra reservation logic so that when we schedule complex1
we reserve an extra next-max slot to be used by complex2
. The biggest problem is guaranteeing complex1
can succeed, which seems quite difficult. Maybe a better way would be to first try to schedule it, and then if it doesn't succeed, turn the postlog2
into a move, put postlog2
back on the ready list, and carry on to try again.