Skip to content
Snippets Groups Projects
  1. Jan 31, 2016
  2. Dec 22, 2015
  3. Nov 18, 2015
  4. Oct 23, 2015
  5. Oct 16, 2015
  6. Oct 13, 2015
  7. Sep 29, 2015
  8. Sep 25, 2015
    • Ben Avison's avatar
      affine-bench: remove 8e margin from COVER area · 2876d8d3
      Ben Avison authored and Pekka Paalanen's avatar Pekka Paalanen committed
      
      Patch "Remove the 8e extra safety margin in COVER_CLIP analysis" reduced
      the required image area for setting the COVER flags in
      pixman.c:analyze_extent(). Do the same reduction in affine-bench.
      
      Leaving the old calculations in place would be very confusing for anyone
      reading the code.
      
      Also add a comment that explains how affine-bench wants to hit the COVER
      paths. This explains why the intricate extent calculations are copied
      from pixman.c.
      
      [Pekka: split patch, change comments, write commit message]
      Signed-off-by: default avatarPekka Paalanen <pekka.paalanen@collabora.co.uk>
      Reviewed-by: default avatarBen Avison <bavison@riscosopen.org>
      2876d8d3
    • Ben Avison's avatar
      Remove the 8e extra safety margin in COVER_CLIP analysis · 0e2e9751
      Ben Avison authored and Pekka Paalanen's avatar Pekka Paalanen committed
      As discussed in
      http://lists.freedesktop.org/archives/pixman/2015-August/003905.html
      
      the 8 * pixman_fixed_e (8e) adjustment which was applied to the transformed
      coordinates is a legacy of rounding errors which used to occur in old
      versions of Pixman, but which no longer apply. For any affine transform,
      you are now guaranteed to get the same result by transforming the upper
      coordinate as though you transform the lower coordinate and add (size-1)
      steps of the increment in source coordinate space. No projective
      transform routines use the COVER_CLIP flags, so they cannot be affected.
      
      Proof by Siarhei Siamashka:
      
      Let's take a look at the following affine transformation matrix (with 16.16
      fixed point values) and two vectors:
      
               | a   b     c    |
      M      = | d   e     f    |
               | 0   0  0x10000 |
      
               |  x_dst  |
      P     =  |  y_dst  |
               | 0x10000 |
      
               | 0x10000 |
      ONE_X  = |    0    |
               |    0    |
      
      The current matrix multiplication code does the following calculations:
      
                   | (a * x_dst + b * y_dst + 0x8000) / 0x10000 + c |
          M * P =  | (d * x_dst + e * y_dst + 0x8000) / 0x10000 + f |
                   |                   0x10000                      |
      
      These calculations are not perfectly exact and we may get rounding
      because the integer coordinates are adjusted by 0.5 (or 0x8000 in the
      16.16 fixed point format) before doing matrix multiplication. For
      example, if the 'a' coefficient is an odd number and 'b' is zero,
      then we are losing some of the least significant bits when dividing by
      0x10000.
      
      So we need to strictly prove that the following expression is always
      true even though we have to deal with rounding:
      
                                                | a |
          M * (P + ONE_X) - M * P = M * ONE_X = | d |
                                                | 0 |
      
      or
      
         ((a * (x_dst + 0x10000) + b * y_dst + 0x8000) / 0x10000 + c)
        -
         ((a * x_dst             + b * y_dst + 0x8000) / 0x10000 + c)
        =
          a
      
      It's easy to see that this is equivalent to
      
          a + ((a * x_dst + b * y_dst + 0x8000) / 0x10000 + c)
            - ((a * x_dst + b * y_dst + 0x8000) / 0x10000 + c)
        =
          a
      
      Which means that stepping exactly by one pixel horizontally in the
      destination image space (advancing 'x_dst' by 0x10000) is the same as
      changing the transformed 'x_src' coordinate in the source image space
      exactly by 'a'. The same applies to the vertical direction too.
      Repeating these steps, we can reach any pixel in the source image
      space and get exactly the same fixed point coordinates as doing
      matrix multiplications per each pixel.
      
      By the way, the older matrix multiplication implementation, which was
      relying on less accurate calculations with three intermediate roundings
      "((a + 0x8000) >> 16) + ((b + 0x8000) >> 16) + ((c + 0x8000) >> 16)",
      also has the same properties. However reverting
          http://cgit.freedesktop.org/pixman/commit/?id=ed39992564beefe6b12f81e842caba11aff98a9c
      and applying this "Remove the 8e extra safety margin in COVER_CLIP
      analysis" patch makes the cover test fail. The real reason why it fails
      is that the old pixman code was using "pixman_transform_point_3d()"
      function
          http://cgit.freedesktop.org/pixman/tree/pixman/pixman-matrix.c?id=pixman-0.28.2#n49
      for getting the transformed coordinate of the top left corner pixel
      in the image scaling code, but at the same time using a different
      "pixman_transform_point()" function
          http://cgit.freedesktop.org/pixman/tree/pixman/pixman-matrix.c?id=pixman-0.28.2#n82
      
      
      in the extents calculation code for setting the cover flag. And these
      functions did the intermediate rounding differently. That's why the 8e
      safety margin was needed.
      
      ** proof ends
      
      However, for COVER_CLIP_NEAREST, the actual margins added were not 8e.
      Because the half-way cases round down, that is, coordinate 0 hits pixel
      index -1 while coordinate e hits pixel index 0, the extra safety margins
      were actually 7e to the left and up, and 9e to the right and down. This
      patch removes the 7e and 9e margins and restores the -e adjustment
      required for NEAREST sampling in Pixman. For reference, see
      pixman/rounding.txt.
      
      For COVER_CLIP_BILINEAR, the margins were exactly 8e as there are no
      additional offsets to be restored, so simply removing the 8e additions
      is enough.
      
      Proof:
      
      All implementations must give the same numerical results as
      bits_image_fetch_pixel_nearest() / bits_image_fetch_pixel_bilinear().
      
      The former does
          int x0 = pixman_fixed_to_int (x - pixman_fixed_e);
      which maps directly to the new test for the nearest flag, when you consider
      that x0 must fall in the interval [0,width).
      
      The latter does
          x1 = x - pixman_fixed_1 / 2;
          x1 = pixman_fixed_to_int (x1);
          x2 = x1 + 1;
      When you write a COVER path, you take advantage of the assumption that
      both x1 and x2 fall in the interval [0, width).
      
      As samplers are allowed to fetch the pixel at x2 unconditionally, we
      require
          x1 >= 0
          x2 < width
      so
          x - pixman_fixed_1 / 2 >= 0
          x - pixman_fixed_1 / 2 + pixman_fixed_1 < width * pixman_fixed_1
      so
          pixman_fixed_to_int (x - pixman_fixed_1 / 2) >= 0
          pixman_fixed_to_int (x + pixman_fixed_1 / 2) < width
      which matches the source code lines for the bilinear case, once you delete
      the lines that add the 8e margin.
      
      Signed-off-by: default avatarBen Avison <bavison@riscosopen.org>
      [Pekka: adjusted commit message, left affine-bench changes for another patch]
      [Pekka: add commit message parts from Siarhei]
      Signed-off-by: default avatarPekka Paalanen <pekka.paalanen@collabora.co.uk>
      Reviewed-by: default avatarSiarhei Siamashka <siarhei.siamashka@gmail.com>
      Reviewed-by: default avatarBen Avison <bavison@riscosopen.org>
      0e2e9751
    • Ben Avison's avatar
      pixman-general: Tighten up calculation of temporary buffer sizes · 23525b4e
      Ben Avison authored and Pekka Paalanen's avatar Pekka Paalanen committed
      
      Each of the aligns can only add a maximum of 15 bytes to the space
      requirement. This permits some edge cases to use the stack buffer where
      previously it would have deduced that a heap buffer was required.
      
      Reviewed-by: default avatarPekka Paalanen <pekka.paalanen@collabora.co.uk>
      23525b4e
  9. Sep 22, 2015
  10. Sep 20, 2015
  11. Sep 18, 2015
  12. Sep 17, 2015
    • Pekka Paalanen's avatar
      armv6: enable over_n_8888 · 73e586ef
      Pekka Paalanen authored
      
      Enable the fast path added in the previous patch by moving the lookup
      table entries to their proper locations.
      
      Lowlevel-blt-bench benchmark statistics with 30 iterations, showing the
      effect of adding this one patch on top of
      "armv6: Add over_n_8888 fast path (disabled)", which was applied on
      fd595692.
      
             Before          After
            Mean StdDev     Mean StdDev   Confidence   Change
      L1    12.5   0.04     45.2   0.10    100.00%    +263.1%
      L2    11.1   0.02     43.2   0.03    100.00%    +289.3%
      M      9.4   0.00     42.4   0.02    100.00%    +351.7%
      HT     8.5   0.02     25.4   0.10    100.00%    +198.8%
      VT     8.4   0.02     22.3   0.07    100.00%    +167.0%
      R      8.2   0.02     23.1   0.09    100.00%    +183.6%
      RT     5.4   0.05     11.4   0.21    100.00%    +110.3%
      
      At most 3 outliers rejected per test per set.
      
      Iterating here means that lowlevel-blt-bench was executed 30 times, and
      the statistics above were computed from the output.
      
      Signed-off-by: default avatarPekka Paalanen <pekka.paalanen@collabora.co.uk>
      73e586ef
    • Ben Avison's avatar
      armv6: Add over_n_8888 fast path (disabled) · 9eb6889b
      Ben Avison authored and Pekka Paalanen's avatar Pekka Paalanen committed
      
      This new fast path is initially disabled by putting the entries in the
      lookup table after the sentinel. The compiler cannot tell the new code
      is not used, so it cannot eliminate the code. Also the lookup table size
      will include the new fast path. When the follow-up patch then enables
      the new fast path, the binary layout (alignments, size, etc.) will stay
      the same compared to the disabled case.
      
      Keeping the binary layout identical is important for benchmarking on
      Raspberry Pi 1. The addresses at which functions are loaded will have a
      significant impact on benchmark results, causing unexpected performance
      changes. Keeping all function addresses the same across the patch
      enabling a new fast path improves the reliability of benchmarks.
      
      Benchmark results are included in the patch enabling this fast path.
      
      [Pekka: disabled the fast path, commit message]
      Signed-off-by: default avatarPekka Paalanen <pekka.paalanen@collabora.co.uk>
      9eb6889b
  13. Sep 16, 2015
    • Ben Avison's avatar
      test: Add cover-test v5 · 4c71f595
      Ben Avison authored and Pekka Paalanen's avatar Pekka Paalanen committed
      
      This test aims to verify both numerical correctness and the honouring of
      array bounds for scaled plots (both nearest-neighbour and bilinear) at or
      close to the boundary conditions for applicability of "cover" type fast paths
      and iter fetch routines.
      
      It has a secondary purpose: by setting the env var EXACT (to any value) it
      will only test plots that are exactly on the boundary condition. This makes
      it possible to ensure that "cover" routines are being used to the maximum,
      although this requires the use of a debugger or code instrumentation to
      verify.
      
      Changes in v4:
      
        Check the fence page size and skip the test if it is too large. Since
        we need to deal with pixman_fixed_t coordinates that go beyond the
        real image width, make the page size limit 16 kB. A 32 kB or larger
        page size would cause an a8 image width to be 32k or more, which is no
        longer representable in pixman_fixed_t.
      
        Use a shorthand variable 'filter' in test_cover().
      
        Whitespace adjustments.
      
      Changes in v5:
      
        Skip if fenced memory is not supported. Do you know of any such
        platform?
      
      Signed-off-by: default avatarBen Avison <bavison@riscosopen.org>
      [Pekka: changes in v4 and v5]
      Signed-off-by: default avatarPekka Paalanen <pekka.paalanen@collabora.co.uk>
      Reviewed-by: default avatarBen Avison <bavison@riscosopen.org>
      Acked-by: default avatarOded Gabbay <oded.gabbay@gmail.com>
      4c71f595
  14. Sep 09, 2015
  15. Sep 03, 2015
    • Pekka Paalanen's avatar
      test: add fence-image-self-test · 07006853
      Pekka Paalanen authored
      
      Tests that fence_malloc and fence_image_create_bits actually work: that
      out-of-bounds and out-of-row (unused stride area) accesses trigger
      SIGSEGV.
      
      If fence_malloc is a dummy (FENCE_MALLOC_ACTIVE not defined), this test
      is skipped.
      
      Changes in v2:
      
      - check FENCE_MALLOC_ACTIVE value, not whether it is defined
      - test that reading bytes near the fence pages does not cause a
        segmentation fault
      
      Changes in v3:
      
      - Do not print progress messages unless VERBOSE environment variable is
        set. Avoid spamming the terminal output of 'make check' on some
        versions of autotools.
      
      Signed-off-by: default avatarPekka Paalanen <pekka.paalanen@collabora.co.uk>
      Reviewed-by: default avatarBen Avison <bavison@riscosopen.org>
      07006853
  16. Sep 01, 2015
  17. Aug 28, 2015
  18. Aug 18, 2015
  19. Aug 01, 2015
  20. Jul 16, 2015
    • Oded Gabbay's avatar
      vmx: implement fast path iterator vmx_fetch_a8 · 8d9be361
      Oded Gabbay authored
      
      no changes were observed when running cairo trimmed benchmarks.
      
      Running "lowlevel-blt-bench src_8_8888" on POWER8, 8 cores,
      3.4GHz, RHEL 7.1 ppc64le gave the following results:
      
      reference memcpy speed = 25197.2MB/s (6299.3MP/s for 32bpp fills)
      
                      Before          After           Change
                    --------------------------------------------
      L1              965.34          3936           +307.73%
      L2              942.99          3436.29        +264.40%
      M               902.24          2757.77        +205.66%
      HT              448.46          784.99         +75.04%
      VT              430.05          819.78         +90.62%
      R               412.9           717.04         +73.66%
      RT              168.93          220.63         +30.60%
      Kops/s          1025            1303           +27.12%
      
      It was benchmarked against commid id e2d211ac from pixman/master
      
      Siarhei Siamashka reported that on playstation3, it shows the following
      results:
      
      == before ==
      
                    src_8_8888 =  L1: 194.37  L2: 198.46  M:155.90 (148.35%)
                    HT: 59.18  VT: 36.71  R: 38.93  RT: 12.79 ( 106Kops/s)
      
      == after ==
      
                    src_8_8888 =  L1: 373.96  L2: 391.10  M:245.81 (233.88%)
                    HT: 80.81  VT: 44.33  R: 48.10  RT: 14.79 ( 122Kops/s)
      
      Signed-off-by: default avatarOded Gabbay <oded.gabbay@gmail.com>
      Acked-by: default avatarSiarhei Siamashka <siarhei.siamashka@gmail.com>
      8d9be361
    • Oded Gabbay's avatar
      vmx: implement fast path iterator vmx_fetch_x8r8g8b8 · 47f74ca9
      Oded Gabbay authored
      
      It was benchmarked against commid id 2be523b2 from pixman/master
      
      POWER8, 8 cores, 3.4GHz, RHEL 7.1 ppc64le.
      
      cairo trimmed benchmarks :
      
      Speedups
      ========
      t-firefox-asteroids  533.92  -> 489.94 :  1.09x
      
      Signed-off-by: default avatarOded Gabbay <oded.gabbay@gmail.com>
      Acked-by: default avatarSiarhei Siamashka <siarhei.siamashka@gmail.com>
      47f74ca9
    • Oded Gabbay's avatar
      vmx: implement fast path scaled nearest vmx_8888_8888_OVER · fcbb97d4
      Oded Gabbay authored
      
      It was benchmarked against commid id 2be523b2 from pixman/master
      
      POWER8, 8 cores, 3.4GHz, RHEL 7.1 ppc64le.
      reference memcpy speed = 24764.8MB/s (6191.2MP/s for 32bpp fills)
      
                      Before           After           Change
                    ---------------------------------------------
      L1              134.36          181.68          +35.22%
      L2              135.07          180.67          +33.76%
      M               134.6           180.51          +34.11%
      HT              121.77          128.79          +5.76%
      VT              120.49          145.07          +20.40%
      R               93.83           102.3           +9.03%
      RT              50.82           46.93           -7.65%
      Kops/s          448             422             -5.80%
      
      cairo trimmed benchmarks :
      
      Speedups
      ========
      t-firefox-asteroids  533.92 -> 497.92 :  1.07x
          t-midori-zoomed  692.98 -> 651.24 :  1.06x
      
      Signed-off-by: default avatarOded Gabbay <oded.gabbay@gmail.com>
      Acked-by: default avatarSiarhei Siamashka <siarhei.siamashka@gmail.com>
      fcbb97d4
    • Oded Gabbay's avatar
      vmx: implement fast path vmx_composite_src_x888_8888 · ad612c42
      Oded Gabbay authored
      
      It was benchmarked against commid id 2be523b2 from pixman/master
      
      POWER8, 8 cores, 3.4GHz, RHEL 7.1 ppc64le.
      reference memcpy speed = 24764.8MB/s (6191.2MP/s for 32bpp fills)
      
                      Before           After           Change
                    ---------------------------------------------
      L1              1115.4          5006.49         +348.85%
      L2              1112.26         4338.01         +290.02%
      M               1110.54         2524.15         +127.29%
      HT              745.41          1140.03         +52.94%
      VT              749.03          1287.13         +71.84%
      R               423.91          547.6           +29.18%
      RT              205.79          194.98          -5.25%
      Kops/s          1414            1361            -3.75%
      
      cairo trimmed benchmarks :
      
      Speedups
      ========
      t-gnome-system-monitor  1402.62  -> 1212.75 :  1.16x
         t-firefox-asteroids   533.92  ->  474.50 :  1.13x
      
      Signed-off-by: default avatarOded Gabbay <oded.gabbay@gmail.com>
      Acked-by: default avatarSiarhei Siamashka <siarhei.siamashka@gmail.com>
      ad612c42
    • Oded Gabbay's avatar
      vmx: implement fast path vmx_composite_over_n_8888_8888_ca · fafc1d40
      Oded Gabbay authored
      
      It was benchmarked against commid id 2be523b2 from pixman/master
      
      POWER8, 8 cores, 3.4GHz, RHEL 7.1 ppc64le.
      
      reference memcpy speed = 24764.8MB/s (6191.2MP/s for 32bpp fills)
      
                      Before           After           Change
                    ---------------------------------------------
      L1              61.92            244.91          +295.53%
      L2              62.74            243.3           +287.79%
      M               63.03            241.94          +283.85%
      HT              59.91            144.22          +140.73%
      VT              59.4             174.39          +193.59%
      R               53.6             111.37          +107.78%
      RT              37.99            46.38           +22.08%
      Kops/s          436              506             +16.06%
      
      cairo trimmed benchmarks :
      
      Speedups
      ========
      t-xfce4-terminal-a1  1540.37 -> 1226.14 :  1.26x
      t-firefox-talos-gfx  1488.59 -> 1209.19 :  1.23x
      
      Slowdowns
      =========
              t-evolution  553.88  -> 581.63  :  1.05x
                t-poppler  364.99  -> 383.79  :  1.05x
      t-firefox-scrolling  1223.65 -> 1304.34 :  1.07x
      
      The slowdowns can be explained in cases where the images are small and
      un-aligned to 16-byte boundary. In that case, the function will first
      work on the un-aligned area, even in operations of 1 byte. In case of
      small images, the overhead of such operations can be more than the
      savings we get from using the vmx instructions that are done on the
      aligned part of the image.
      
      In the C fast-path implementation, there is no special treatment for the
      un-aligned part, as it works in 4 byte quantities on the entire image.
      
      Because llbb is a synthetic test, I would assume it has much less
      alignment issues than "real-world" scenario, such as cairo benchmarks,
      which are basically recorded traces of real application activity.
      
      Signed-off-by: default avatarOded Gabbay <oded.gabbay@gmail.com>
      Acked-by: default avatarSiarhei Siamashka <siarhei.siamashka@gmail.com>
      fafc1d40
    • Oded Gabbay's avatar
      vmx: implement fast path composite_add_8888_8888 · a3e91440
      Oded Gabbay authored
      
      Copied impl. from sse2 file and edited to use vmx functions
      
      It was benchmarked against commid id 2be523b2 from pixman/master
      
      POWER8, 16 cores, 3.4GHz, ppc64le :
      
      reference memcpy speed = 27036.4MB/s (6759.1MP/s for 32bpp fills)
      
                      Before           After           Change
                    ---------------------------------------------
      L1              248.76          3284.48         +1220.34%
      L2              264.09          2826.47         +970.27%
      M               261.24          2405.06         +820.63%
      HT              217.27          857.3           +294.58%
      VT              213.78          980.09          +358.46%
      R               176.61          442.95          +150.81%
      RT              107.54          150.08          +39.56%
      Kops/s          917             1125            +22.68%
      
      Signed-off-by: default avatarOded Gabbay <oded.gabbay@gmail.com>
      Acked-by: default avatarSiarhei Siamashka <siarhei.siamashka@gmail.com>
      a3e91440
    • Oded Gabbay's avatar
      vmx: implement fast path composite_add_8_8 · d5b5343c
      Oded Gabbay authored
      
      Copied impl. from sse2 file and edited to use vmx functions
      
      It was benchmarked against commid id 2be523b2 from pixman/master
      
      POWER8, 16 cores, 3.4GHz, ppc64le :
      
      reference memcpy speed = 27036.4MB/s (6759.1MP/s for 32bpp fills)
      
                      Before           After           Change
                    ---------------------------------------------
      L1              687.63          9140.84         +1229.33%
      L2              715             7495.78         +948.36%
      M               717.39          8460.14         +1079.29%
      HT              569.56          1020.12         +79.11%
      VT              520.3           1215.56         +133.63%
      R               514.81          874.35          +69.84%
      RT              341.28          305.42          -10.51%
      Kops/s          1621            1579            -2.59%
      
      Signed-off-by: default avatarOded Gabbay <oded.gabbay@gmail.com>
      Acked-by: default avatarSiarhei Siamashka <siarhei.siamashka@gmail.com>
      d5b5343c
Loading