aco/gfx10+: work around non uniform ds_append wave64 result
In wave64 for hw with native wave32, ds_append seems to be split in a load for the low half and an atomic for the high half, and other LDS instructions can be scheduled between the two. Which means the result of the low half is unusable because it might be out of date.
I was only able to reproduce this issue in WGP mode, but be conservative and apply the workaround in CU mode too.
Closes: #11921 (closed)
Fixes: 45e93580 ("aco: implement nir_shared_append/consume_amd")
Edited by Georg Lehmann