intel/fs: don't forget the stride at generate_shuffle
During generate_shuffle(), when we use byte sized registers we end up with a destination stride of 2. We don't take the stride into consideration when selecting the group offset for the last MOV operation, which means we end up moving things to the wrong place, leaving the last few channels untouched. Take the destination stride in consideration so we don't miss the last channels.
Cc: @jekstrand
PS: I didn't get our CI results yet for this, although I did local tests. So please don't merge yet.