ac,radeonsi: clear rework, compute/cpdma flags rework, copy shader optimizations, etc. (BIG MR)

Marek Olšák requested to merge mareko/mesa:si-clear-retile-compute into master

This MR continues in !10003 (merged).

Below is the first half.

  • Explicit DCC/CMASK clears are parallelized.
  • HTILE is enabled for all levels where it's possible (not just level 0).
  • Sync flags for CP DMA and internal compute are reworked. Now all callers can specify when they want to sync (e.g. before/after).
  • The maximum variable compute shader workgroup size decreased from 1024 to 512 threads to optimize user SGPR usage in internal shaders (to pack the size in 10 bits per channel).
  • Some internal compute shaders are optimized.

Tested piglit/glcts/deqp:

  • gfx6-7
  • gfx8 (Polaris11)
  • gfx9: (Vega10)
  • gfx10: (Navi14)
  • gfx10.3 (Sienna)
Edited by Marek Olšák

