aco: some extract() and other optimizations for 16bit
The general idea of this MR is to better re-use packed vec2fp16
values as v1
ssa, and access them via swizzles and SDWA. This can avoid lots of copies, especially when more aggressive vectorization is in place.