aco: Add faster code path to store_lds for consecutive write mask.
This makes it more likely to hit the fast path for count == 1 in the split_store_data function.
As a result, slightly better code (without SDWA) is generated from u2u8
+ store_shared
in NGG culling shaders.