vc4: Use NEON to speed up utile loads on Pi2.
We had a lot of memcpy call overhead because gpu_stride wasn't being inlined. But if you split out the stride==8 and stride==16 cases like this code does while still using memcpy, you'd no longer have glibc's NEON memcpy applied at which point we'd be doing 16 uncached reads instead of 64/(NEON memcpy granularity), for about a 30% performance hit. By hand writing the assembly, we can get a whole cacheline loaded at a time. Unfortunately, NEON intrinsics turned out to be unusable -- they didn't have the vldm instruction available. Note that, for now, the NEON code is only enabled when building for ARMv7 (Pi 2+). We may want to do runtime detection for the Raspbian case, in the future. Improves 1024x1024 GetTexImage by 208.256% +/- 7.07029% (n=10).
Showing
- src/gallium/drivers/vc4/Makefile.am 6 additions, 0 deletionssrc/gallium/drivers/vc4/Makefile.am
- src/gallium/drivers/vc4/vc4_tiling.h 42 additions, 6 deletionssrc/gallium/drivers/vc4/vc4_tiling.h
- src/gallium/drivers/vc4/vc4_tiling_lt.c 67 additions, 12 deletionssrc/gallium/drivers/vc4/vc4_tiling_lt.c
Loading
Please register or sign in to comment