Skip to content
GitLab
Projects Groups Snippets
  • /
  • Help
    • Help
    • Support
    • Community forum
    • Submit feedback
    • Contribute to GitLab
  • Register
  • Sign in
  • mesa mesa
  • Project information
    • Project information
    • Activity
    • Labels
    • Members
  • Repository
    • Repository
    • Files
    • Commits
    • Branches
    • Tags
    • Contributors
    • Graph
    • Compare
  • Issues 3,076
    • Issues 3,076
    • List
    • Boards
    • Service Desk
    • Milestones
  • Merge requests 1,007
    • Merge requests 1,007
  • CI/CD
    • CI/CD
    • Pipelines
    • Jobs
    • Schedules
  • Deployments
    • Deployments
    • Releases
  • Packages and registries
    • Packages and registries
    • Container Registry
  • Analytics
    • Analytics
    • Value stream
    • CI/CD
    • Repository
  • Activity
  • Graph
  • Create a new issue
  • Jobs
  • Commits
  • Issue Boards
Collapse sidebar

Due to an influx of spam, we have had to impose restrictions on new accounts. Please see this wiki page for instructions on how to get full permissions. Sorry for the inconvenience.

  • MesaMesa
  • mesamesa
  • Merge requests
  • !3050

gallium/auxiliary: Reduce conversions in u_vbuf_get_minmax_index_mapped

  • Review changes

  • Download
  • Email patches
  • Plain diff
Merged Icecream95 requested to merge icecream95/mesa:optimize-minmax into master Dec 11, 2019
  • Overview 19
  • Commits 2
  • Pipelines 10
  • Changes 1

With this patch, GCC generates vectorized code that does the comparisons with the min and max values without converting the indices to 32-bit first.

This optimization makes u_vbuf_get_minmax_index_mapped almost twice as fast for ARM NEON, and should speed up vectorised code on other platforms.

Without vectorisation, the function is still a percent or two faster, but slightly larger.

Perf data from running vblank_mode=0 neverball -l data/map-paxed3/hypnos.sol for one minute:

Mesa was compiled with CFLAGS='-mfpu=neon' and --buildtype=release.

Before:

     9.79%  neverball       rockchip_dri.so            [.] panfrost_flush_all_batches
     8.77%  neverball       rockchip_dri.so            [.] u_vbuf_get_minmax_index_mapped
     5.08%  neverball       libc-2.29.so               [.] memcpy
     3.41%  neverball       libgcc_s.so.1              [.] __udivsi3

After:

    11.69%  neverball       rockchip_dri.so            [.] panfrost_flush_all_batches
     5.20%  neverball       libc-2.29.so               [.] memcpy
     4.78%  neverball       rockchip_dri.so            [.] u_vbuf_get_minmax_index_mapped
     3.78%  neverball       libgcc_s.so.1              [.] __udivsi3

The old code used ~0u instead of -1, so I used a similar ~((unsigned short)0) for the smaller datatypes, though it would probably be clearer (if not as correct) to just use -1.

Here is the inner loop for the 4th case (ushort, no primitive_restart):

Before:

<+1156>:  vld1.16   {d18-d19}, [r0]!   ; Load 16-bit values
<+1160>:  vmovl.u16 q10, d18           ; Conversion to 32-bit here
<+1164>:  cmp       r0, lr
<+1168>:  vmovl.u16 q9, d19            ;  and here
<+1172>:  vmax.u32  q12, q10, q9
<+1176>:  vmin.u32  q9, q10, q9
<+1180>:  vmax.u32  q8, q8, q12
<+1184>:  vmin.u32  q11, q11, q9
<+1188>:  bne        <u_vbuf_get_minmax_index_mapped+1156>

After:

<+1180>:  vld1.16   {d16-d17}, [r0]!
<+1184>:  cmp       r0, r12
<+1188>:  vmax.u16  q9, q9, q8
<+1192>:  vmin.u16  q10, q10, q8
<+1196>:  bne        <u_vbuf_get_minmax_index_mapped+1180>
Assignee
Assign to
Reviewers
Request review from
Time tracking
Source branch: optimize-minmax