freedreno/a6xx: CPU and CP overhead optimizations
A combination of tweaks to reduce CP overhead (esp. important for high draw count when not in bypass mode, ie. where CP can potentially need to execute the state change per-bin), and CPU overhead. Mostly from spending some time looking at gfxbench gl_driver2, which does a ton of tiny draws with frequent state changes.
the overall improvements are:
- GMEM case (CP limited) - 10%
- bypass case (CPU limited) - 50%
ofc we still need to figure out how to do something like #2798 (closed) to realize that despite the high number of draw calls, we would be better off using sysmem.