gallium/auxiliary: Reduce conversions in u_vbuf_get_minmax_index_mapped

Icecream95 requested to merge icecream95/mesa:optimize-minmax into master

With this patch, GCC generates vectorized code that does the comparisons with the min and max values without converting the indices to 32-bit first.

This optimization makes u_vbuf_get_minmax_index_mapped almost twice as fast for ARM NEON, and should speed up vectorised code on other platforms.

Without vectorisation, the function is still a percent or two faster, but slightly larger.

Perf data from running vblank_mode=0 neverball -l data/map-paxed3/hypnos.sol for one minute:

Mesa was compiled with CFLAGS='-mfpu=neon' and --buildtype=release.


     9.79%  neverball            [.] panfrost_flush_all_batches
     8.77%  neverball            [.] u_vbuf_get_minmax_index_mapped
     5.08%  neverball               [.] memcpy
     3.41%  neverball              [.] __udivsi3


    11.69%  neverball            [.] panfrost_flush_all_batches
     5.20%  neverball               [.] memcpy
     4.78%  neverball            [.] u_vbuf_get_minmax_index_mapped
     3.78%  neverball              [.] __udivsi3

The old code used ~0u instead of -1, so I used a similar ~((unsigned short)0) for the smaller datatypes, though it would probably be clearer (if not as correct) to just use -1.

Here is the inner loop for the 4th case (ushort, no primitive_restart):


<+1156>:  vld1.16   {d18-d19}, [r0]!   ; Load 16-bit values
<+1160>:  vmovl.u16 q10, d18           ; Conversion to 32-bit here
<+1164>:  cmp       r0, lr
<+1168>:  vmovl.u16 q9, d19            ;  and here
<+1172>:  vmax.u32  q12, q10, q9
<+1176>:  vmin.u32  q9, q10, q9
<+1180>:  vmax.u32  q8, q8, q12
<+1184>:  vmin.u32  q11, q11, q9
<+1188>:  bne        <u_vbuf_get_minmax_index_mapped+1156>


<+1180>:  vld1.16   {d16-d17}, [r0]!
<+1184>:  cmp       r0, r12
<+1188>:  vmax.u16  q9, q9, q8
<+1192>:  vmin.u16  q10, q10, q8
<+1196>:  bne        <u_vbuf_get_minmax_index_mapped+1180>

