util: NEON optimization for format unpack (deqp perf fix)
Since our freedreno runner farm is a fixed size but we keep wanting to test more stuff, I did a bit of looking to see if we had low hanging fruit for making deqp finish faster. It turns out b8g8r8a8_unorm reads are 5-10% of the profile, and bigger bus transactions from using SIMD can be a huge win (though not nearly as large as one might hope).
- Should we bake the generic and optimized tables together using call_once()?
- Hook it up on armv7 too.
- SSE version? (could help BXT since it's !LLC, and we're going to have BXT in CI soon)
- Does piglit have any hot unpack functions?
- Do apps have any hot pack functions for texture upload?
- fix softpipe texturing regression from the unpack row change