Draft: r300: utilize alternate temporary registers in vertex shaders (CXBX-R fixes)
This series aims to fix main issues with the CXBX-R emulator with Wine-d3d. There is a small unrolling tweak but the main part is making use of the alternate temporary register memory to fix the "run out of registers" VP errors. Unfortunately even after this series it doesn't work properly, there is still a heavy vertex corruption, however that last one is the well-known uniform problem of Wine. The uniform problems should be hopefully not present if I manage to make nine work with r300 though.
Alternate temporary register memory (ATRM) is memory designed for the pair scheduling as it is the only memory that the math part of the dual instruction can write. We however don't support the pair scheduling for now so we can just use ATRM registers instead of normal temporary register with some limitations.
Similarly to input or constant memory, we can't use two different sources from ATRM in a single instruction. Additionally while we can use 20 ATRM registers in a shader, we have just 20 in total, so if we use them too eagerly, we can completely remove NUM_CNTRLS (i.e., vertex processing concurrency). If we use them with a care, we can actually increase NUM_CNTRLS by reducing the amount of normal temporary registers needed. There is no change in this regard for RV530 with my shader db (we have 128 temp registers in total so only using more than 25 would lead to NUM_CNTRLS reduction as 5 is the maximum), however R300 and R400 only have 72, so maybe there it should help a bit.
Also if we ever support the pair scheduling, we still could use the logic from this patch for reusing input registers when we run out of normal temporaries (because the same single input read limitation would apply there as well). Arguably, the current vertex register allocator is really stupid, so I can't rule out that we could solve the issues without ATRM just by making is smarter. I however did not want to go this way since I'm still hoping we will have pair scheduling one day and than it would need a rewrite anyway. So this is supposed to be sort of medium-term solution.
This is a draft right now for several reasons:
- I would be happy for any feedback regarding the whole concept.
- There is a small increase in register usage (~2% in shader-db). This is because sometimes when atemp we allocated previously is no longer live and we could in theory reuse it, but we can't because of conflict with another atemp, so we have to allocate another normal temp register (which we wouldn't have to do if we used normal temp in the first place). I still need to tweak the logic where we use temp and atemp a bit.
- the conflict (interference) tracking of atemps is really stupid, there is just a 2D bool array where the conflicts are saved, so there should probably be some smarter structure to save some space.
- Proper benchmarking is missing (I run Sanctuary and Lightsmark and there might be something like 0.5% fps slowdown, but it will need many more runs to make sure when the change is so small, could be just noise as well.
- dEQP run still missing (piglit is happy though)