llvmpipe: optimise compute shader launch and coroutines interactions
the CL CTS conversions test is horrible, this takes about 1/4 of the execution time off it. Still takes days.
The two big things were the kernel launch for 49000x1x1 workgroups was very suboptimal so hitting a mutex a lot
The coroutine stuff wasn't as optimised as it I thought it was.