Skip to content
  • Emma Anholt's avatar
    vc4: Use NEON to speed up utile loads on Pi2. · 4d300242
    Emma Anholt authored
    We had a lot of memcpy call overhead because gpu_stride wasn't being
    inlined.  But if you split out the stride==8 and stride==16 cases like
    this code does while still using memcpy, you'd no longer have glibc's
    NEON memcpy applied at which point we'd be doing 16 uncached reads
    instead of 64/(NEON memcpy granularity), for about a 30% performance
    hit.  By hand writing the assembly, we can get a whole cacheline
    loaded at a time.
    
    Unfortunately, NEON intrinsics turned out to be unusable -- they
    didn't have the vldm instruction available.
    
    Note that, for now, the NEON code is only enabled when building for ARMv7
    (Pi 2+).  We may want to do runtime detection for the Raspbian case, in
    the future.
    
    Improves 1024x1024 GetTexImage by 208.256% +/- 7.07029% (n=10).
    4d300242