Skip to content
Snippets Groups Projects
  1. Jun 05, 2019
  2. Jan 04, 2019
  3. May 12, 2018
  4. Feb 07, 2018
    • Clement Courbet's avatar
      lib: optimize cpumask_next_and() · 0ade34c3
      Clement Courbet authored
      We've measured that we spend ~0.6% of sys cpu time in cpumask_next_and().
      It's essentially a joined iteration in search for a non-zero bit, which is
      currently implemented as a lookup join (find a nonzero bit on the lhs,
      lookup the rhs to see if it's set there).
      
      Implement a direct join (find a nonzero bit on the incrementally built
      join).  Also add generic bitmap benchmarks in the new `test_find_bit`
      module for new function (see `find_next_and_bit` in [2] and [3] below).
      
      For cpumask_next_and, direct benchmarking shows that it's 1.17x to 14x
      faster with a geometric mean of 2.1 on 32 CPUs [1].  No impact on memory
      usage.  Note that on Arm, the new pure-C implementation still outperforms
      the old one that uses a mix of C and asm (`find_next_bit`) [3].
      
      [1] Approximate benchmark code:
      
      ```
        unsigned long src1p[nr_cpumask_longs] = {pattern1};
        unsigned long src2p[nr_cpumask_longs] = {pattern2};
        for (/*a bunch of repetitions*/) {
          for (int n = -1; n <= nr_cpu_ids; ++n) {
            asm volatile("" : "+rm"(src1p)); // prevent any optimization
            asm volatile("" : "+rm"(src2p));
            unsigned long result = cpumask_next_and(n, src1p, src2p);
            asm volatile("" : "+rm"(result));
          }
        }
      ```
      
      Results:
      pattern1    pattern2     time_before/time_after
      0x0000ffff  0x0000ffff   1.65
      0x0000ffff  0x00005555   2.24
      0x0000ffff  0x00001111   2.94
      0x0000ffff  0x00000000   14.0
      0x00005555  0x0000ffff   1.67
      0x00005555  0x00005555   1.71
      0x00005555  0x00001111   1.90
      0x00005555  0x00000000   6.58
      0x00001111  0x0000ffff   1.46
      0x00001111  0x00005555   1.49
      0x00001111  0x00001111   1.45
      0x00001111  0x00000000   3.10
      0x00000000  0x0000ffff   1.18
      0x00000000  0x00005555   1.18
      0x00000000  0x00001111   1.17
      0x00000000  0x00000000   1.25
      -----------------------------
                     geo.mean  2.06
      
      [2] test_find_next_bit, X86 (skylake)
      
       [ 3913.477422] Start testing find_bit() with random-filled bitmap
       [ 3913.477847] find_next_bit: 160868 cycles, 16484 iterations
       [ 3913.477933] find_next_zero_bit: 169542 cycles, 16285 iterations
       [ 3913.478036] find_last_bit: 201638 cycles, 16483 iterations
       [ 3913.480214] find_first_bit: 4353244 cycles, 16484 iterations
       [ 3913.480216] Start testing find_next_and_bit() with random-filled
       bitmap
       [ 3913.481074] find_next_and_bit: 89604 cycles, 8216 iterations
       [ 3913.481075] Start testing find_bit() with sparse bitmap
       [ 3913.481078] find_next_bit: 2536 cycles, 66 iterations
       [ 3913.481252] find_next_zero_bit: 344404 cycles, 32703 iterations
       [ 3913.481255] find_last_bit: 2006 cycles, 66 iterations
       [ 3913.481265] find_first_bit: 17488 cycles, 66 iterations
       [ 3913.481266] Start testing find_next_and_bit() with sparse bitmap
       [ 3913.481272] find_next_and_bit: 764 cycles, 1 iterations
      
      [3] test_find_next_bit, arm (v7 odroid XU3).
      
      [  267.206928] Start testing find_bit() with random-filled bitmap
      [  267.214752] find_next_bit: 4474 cycles, 16419 iterations
      [  267.221850] find_next_zero_bit: 5976 cycles, 16350 iterations
      [  267.229294] find_last_bit: 4209 cycles, 16419 iterations
      [  267.279131] find_first_bit: 1032991 cycles, 16420 iterations
      [  267.286265] Start testing find_next_and_bit() with random-filled
      bitmap
      [  267.302386] find_next_and_bit: 2290 cycles, 8140 iterations
      [  267.309422] Start testing find_bit() with sparse bitmap
      [  267.316054] find_next_bit: 191 cycles, 66 iterations
      [  267.322726] find_next_zero_bit: 8758 cycles, 32703 iterations
      [  267.329803] find_last_bit: 84 cycles, 66 iterations
      [  267.336169] find_first_bit: 4118 cycles, 66 iterations
      [  267.342627] Start testing find_next_and_bit() with sparse bitmap
      [  267.356919] find_next_and_bit: 91 cycles, 1 iterations
      
      [courbet@google.com: v6]
        Link: http://lkml.kernel.org/r/20171129095715.23430-1-courbet@google.com
      [geert@linux-m68k.org: m68k/bitops: always include <asm-generic/bitops/find.h>]
        Link: http://lkml.kernel.org/r/1512556816-28627-1-git-send-email-geert@linux-m68k.org
      Link: http://lkml.kernel.org/r/20171128131334.23491-1-courbet@google.com
      
      
      Signed-off-by: default avatarClement Courbet <courbet@google.com>
      Signed-off-by: default avatarGeert Uytterhoeven <geert@linux-m68k.org>
      Cc: Yury Norov <ynorov@caviumnetworks.com>
      Cc: Geert Uytterhoeven <geert@linux-m68k.org>
      Cc: Alexey Dobriyan <adobriyan@gmail.com>
      Cc: Rasmus Villemoes <linux@rasmusvillemoes.dk>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      0ade34c3
    • Yury Norov's avatar
      lib/find_bit_benchmark.c: improvements · 15ff67bf
      Yury Norov authored
      As suggested in review comments:
      * printk: align numbers using whitespaces instead of tabs;
      * return error value from init() to avoid calling rmmod if testing again;
      * use ktime_get instead of get_cycles as some arches don't support it;
      
      The output in dmesg (on QEMU arm64):
      [   38.823430] Start testing find_bit() with random-filled bitmap
      [   38.845358] find_next_bit:                20138448 ns, 163968 iterations
      [   38.856217] find_next_zero_bit:           10615328 ns, 163713 iterations
      [   38.863564] find_last_bit:                 7111888 ns, 163967 iterations
      [   40.944796] find_first_bit:             2081007216 ns, 163968 iterations
      [   40.944975]
      [   40.944975] Start testing find_bit() with sparse bitmap
      [   40.945268] find_next_bit:                   73216 ns,    656 iterations
      [   40.967858] find_next_zero_bit:           22461008 ns, 327025 iterations
      [   40.968047] find_last_bit:                   62320 ns,    656 iterations
      [   40.978060] find_first_bit:                9889360 ns,    656 iterations
      
      Link: http://lkml.kernel.org/r/20171124143040.a44jvhmnaiyedg2i@yury-thinkpad
      
      
      Signed-off-by: default avatarYury Norov <ynorov@caviumnetworks.com>
      Tested-by: default avatarGeert Uytterhoeven <geert@linux-m68k.org>
      Cc: Alexey Dobriyan <adobriyan@gmail.com>
      Cc: Clement Courbet <courbet@google.com>
      Cc: Matthew Wilcox <mawilcox@microsoft.com>
      Cc: Rasmus Villemoes <linux@rasmusvillemoes.dk>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      15ff67bf
    • Yury Norov's avatar
      lib/test_find_bit.c: rename to find_bit_benchmark.c · dceeb3e7
      Yury Norov authored
      As suggested in review comments, rename test_find_bit.c to
      find_bit_benchmark.c.
      
      Link: http://lkml.kernel.org/r/20171124143040.a44jvhmnaiyedg2i@yury-thinkpad
      
      
      Signed-off-by: default avatarYury Norov <ynorov@caviumnetworks.com>
      Tested-by: default avatarGeert Uytterhoeven <geert@linux-m68k.org>
      Cc: Alexey Dobriyan <adobriyan@gmail.com>
      Cc: Clement Courbet <courbet@google.com>
      Cc: Matthew Wilcox <mawilcox@microsoft.com>
      Cc: Rasmus Villemoes <linux@rasmusvillemoes.dk>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      dceeb3e7
  5. Nov 18, 2017
Loading