tu: Workaround a quirk of CCU flushing
This bug has been observed in two places with TU_DEBUG=sysmem:
- Sporadically when dEQP-VK.renderpass.suballocation.formats.d16_unorm.* is run in a series.
- Consistently with dEQP-VK.renderpass.suballocation.subpass_dependencies.late_fragment_tests.render_size_128_128.subpass_count_3.d24_unorm_s8_uint.
In both cases there is a blit/draw followed by a CCU_FLUSH_* event followed by another blit/draw reading the color/depth attachment under a different cache domain. Even though the flushes are emitted correctly, the read is corrupted. This has been observed with both color and depth flushes, and with and without UBWC. In the second test, the following blit/draw is in the next subpass and the blob flushes CCU at the end of each subpass and emits a WFI + write to RB_CCU_CNTL between each subpass when forced to use sysmem. This means that the correct workaround is probably to WFI after the flush. We add a WFI after every CCU_FLUSH_*.