With this patch, GCC generates vectorized code that does the comparisons with the min and max values without converting the indices to 32-bit first.
This optimization makes u_vbuf_get_minmax_index_mapped almost twice as fast for ARM NEON, and should speed up vectorised code on other platforms.
Without vectorisation, the function is still a percent or two faster, but slightly larger.
Perf data from running vblank_mode=0 neverball -l data/map-paxed3/hypnos.sol
for one minute:
Mesa was compiled with CFLAGS='-mfpu=neon'
and --buildtype=release
.
Before:
9.79% neverball rockchip_dri.so [.] panfrost_flush_all_batches
8.77% neverball rockchip_dri.so [.] u_vbuf_get_minmax_index_mapped
5.08% neverball libc-2.29.so [.] memcpy
3.41% neverball libgcc_s.so.1 [.] __udivsi3
After:
11.69% neverball rockchip_dri.so [.] panfrost_flush_all_batches
5.20% neverball libc-2.29.so [.] memcpy
4.78% neverball rockchip_dri.so [.] u_vbuf_get_minmax_index_mapped
3.78% neverball libgcc_s.so.1 [.] __udivsi3
The old code used ~0u
instead of -1
, so I used a similar ~((unsigned short)0)
for the smaller datatypes, though it would probably be clearer (if not as correct) to just use -1
.
Here is the inner loop for the 4th case (ushort, no primitive_restart):
Before:
<+1156>: vld1.16 {d18-d19}, [r0]! ; Load 16-bit values
<+1160>: vmovl.u16 q10, d18 ; Conversion to 32-bit here
<+1164>: cmp r0, lr
<+1168>: vmovl.u16 q9, d19 ; and here
<+1172>: vmax.u32 q12, q10, q9
<+1176>: vmin.u32 q9, q10, q9
<+1180>: vmax.u32 q8, q8, q12
<+1184>: vmin.u32 q11, q11, q9
<+1188>: bne <u_vbuf_get_minmax_index_mapped+1156>
After:
<+1180>: vld1.16 {d16-d17}, [r0]!
<+1184>: cmp r0, r12
<+1188>: vmax.u16 q9, q9, q8
<+1192>: vmin.u16 q10, q10, q8
<+1196>: bne <u_vbuf_get_minmax_index_mapped+1180>