GP complex instruction results cannot be spilled/moved

I couldn't get exp2/log2 to work, so I started to reverse-engineer a little bit what these magic complex opcodes are doing.

Overall, it's actually quite similar to what's described here for nvidia. complex2 multiplies the inputs, combined with adding a strange offset sometimes (I coudln't figure out why), so with the way the blob uses it it's effectively squaring the input. Each of the complex opcodes lookup polynomial coefficients in a different table, and complex1 computes the rest of the polynomial and does the output exponent correction. I suspect that the table entries are more than 32 bits, and that the two different complex1 sources actually receive two different parts of the table entry. preexp2 and postlog2 convert to/from a fixed-point format which makes doing the exponent correction easier (again similar to nvidia). I suspect there are similar shenanigans going on with preexp2 since in my tests it sometimes would return identical values for two different inputs, hence probably different uses of preexp2 are getting different values to compensate for 32 bits not being enough. I haven't gotten the details nailed down, but I don't think we really have to.

Now, from this description, it should be clear that preexp2 and the table-lookup opcodes are doing something quite weird. There's the further issue that complex1 produces something that isn't supposed to be interpreted as a floating-point value in log2 mode, it's a fixed-point value that's supposed to be post-processed by postlog2. So sometimes it produces what would be an "invalid" floating-point value that would never be produced otherwise, i.e. either a denormalized value or a NaN with a non-standard payload. These get flushed to 0 and the standard NaN respectively when you try to do anything floating-point-y, and since a move in the add or mul slots is just adding -0 or multiplying by 1 respectively, a move between complex1 and postlog2 will break things. And of course, the same issue exists with a move between preexp2 and anything, and a LUT opcode and anything. And preexp2 and LUT opcodes are already magically producing multiple values anyways.

So, there are a few nodes we absolutely can't insert a move after:

preexp2
*_impl
complex1 when consumed by postlog2

Technically we can for complex2, but since complex2 sometimes has preexp2 as a source it sometimes has to be scheduled right before complex1. All in all, we almost always have to make sure that these instructions occur in the same exact sequence they do in the blob.

Some of these nodes we can easily guarantee to succeed if we schedule them first, namely preexp2 (it's always a max node when scheduled that doesn't increase register pressure) and *_impl (it's in the complex slot, hence unaffected by max-node reservations). We're not so lucky with complex2, but I think we can add some extra reservation logic so that when we schedule complex1 we reserve an extra next-max slot to be used by complex2. The biggest problem is guaranteeing complex1 can succeed, which seems quite difficult. Maybe a better way would be to first try to schedule it, and then if it doesn't succeed, turn the postlog2 into a move, put postlog2 back on the ready list, and carry on to try again.

Edited Jul 27, 2019 by Connor Abbott

Admin message

GP complex instruction results cannot be spilled/moved