WIP: tu: FSDT support for A7XX
This merge request adds support in turnip for A7XX's CP_FIXED_STRIDE_DRAW_TABLE, correctness has so far been tested with specific CTS tests (full CTS run incoming), vulkan demos, game traces/games on A750.
CP_FIXED_STRIDE_DRAW_TABLE provides the hardware with a table of draws that all have exactly the same size, thus allowing them to be skipped without corrupting state in the case they are found to be dead (eg, in the case of draws outside the current bin).
In this implementation we end the draw table on a few different occasions:
- on barriers inside of a renderpass.
- in the case of indirect draws (they are broken for every case that doesn't have drawCount == 1), as the proprietary driver does.
- in the case of switching draw types, to avoid having to pad draws with nops, this case doesn't seem to happen particularly often and changing this behaviour would be trivial in the future.
- on other commands that can be executed inside of a render pass that expect previous draws to have been completed, such as xfb-related commands and on attachment clears.
Draw states have been consolidated and some state has been moved into draw states to ease full state emission.
As the feature stands now, it significantly increases overhead of draws in vkoverhead
, the bottleneck is entirely in the number of draw states emitted per-draw (the proprietary driver does not incur in this issue because it uses far fewer draw states, for vkoverhead the tipping point appears to be at 25 draw states on a snapdragon 8 gen 3 board, but the proprietary driver uses less than 20 states to our 32). Tests on apps reveal no slowdown so far, and no improvement from an experimental hack to reduce draw states down to a number that brings the vkoverhead
results back in line with main.
Performance generally matches that of main
, with no significant changes, but inspecting the amount of draws actually being executed they are consistently reduced by ~50%. Investigation into performance benefits is still ongoing, as the proprietary driver also uses FSDT in almost every situation.
Results are exactly the same with some experimental LRZ-enabling patches (with and without conservative LRZ).