Samuel implemented a bit earlier but this:
- Does a refactoring so that transitions also benefit from not having to flush the L2 cache.
- Switches CP DMA to use L2 on GFX9+, so that we can avoid invalidating L2 after TRANSFER_WRITE.
- Avoid doing any L2 cache flushes on GFX10 (when the TCC isn't harvested)
Tested that the refactoring doesn't change performance at all on GFX10, and the L2 cache flush elimination gains ~2% performance on basemark.