ac,radeonsi: add optimized clear/copy_buffer compute shader into AMD common code, supporting unaligned copies
This is a substantial improvement of the clear/copy_buffer compute shader in radeonsi, which is also moved to src/amd/common
.
This adds support for unaligned buffer clears and copies while maintaining the same performance as aligned clears and copies. The optimal alignment for buffer offsets is 256, not 4.
More chip-specific tuning will follow, but this is already optimal for Navi31.