Skip to content
Snippets Groups Projects
  1. Nov 02, 2013
  2. Oct 17, 2013
    • Søren Sandmann Pedersen's avatar
      pixman-glyph.c: Add __force_align_arg_pointer to composite functions · 3c2f4b65
      Søren Sandmann Pedersen authored
      The functions pixman_composite_glyphs_no_mask() and
      pixman_composite_glyphs() can call into code compiled with -msse2,
      which requires the stack to be aligned to 16 bytes. Since the ABIs on
      Windows and Linux for x86-32 don't provide this guarantee, we need to
      use this attribute to make GCC generate a prologue that realigns the
      stack.
      
      This fixes the crash introduced in the previous commit and also
      
         https://bugs.freedesktop.org/show_bug.cgi?id=70348
      
      and
      
         https://bugs.freedesktop.org/show_bug.cgi?id=68300
      3c2f4b65
    • Søren Sandmann Pedersen's avatar
      utils.c: On x86-32 unalign the stack before calling test_function · 3dce2297
      Søren Sandmann Pedersen authored
      GCC when compiling with -msse2 and -mssse3 will assume that the stack
      is aligned to 16 bytes even on x86-32 and accordingly issue movdqa
      instructions for stack allocated variables.
      
      But despite what GCC thinks, the standard ABI on x86-32 only requires
      a 4-byte aligned stack. This is true at least on Windows, but there
      also was (and maybe still is) Linux code in the wild that assumed
      this. When such code calls into pixman and hits something compiled
      with -msse2, we get a segfault from the unaligned movdqas.
      
      Pixman has worked around this issue in the past with the gcc attribute
      "force_align_arg_pointer" but the problem has resurfaced now in
      
          https://bugs.freedesktop.org/show_bug.cgi?id=68300
      
      because pixman_composite_glyphs() is missing this attribute.
      
      This patch makes fuzzer_test_main() call the test_function through a
      trampoline, which, on x86-32, has a bit of assembly that deliberately
      avoids aligning the stack to 16 bytes as GCC normally expects. The
      result is that glyph-test now crashes.
      
      V2: Mark caller-save registers as clobbered, rather than using
      noinline on the trampoline.
      3dce2297
  3. Oct 13, 2013
    • Siarhei Siamashka's avatar
      configure.ac: check and use -Wdeclaration-after-statement GCC option · 9e81419e
      Siarhei Siamashka authored
      The accidental use of declaration after statement breaks compilation
      with C89 compilers such as MSVC. Assuming that MSVC is one of the
      supported compilers, it makes sense to ask GCC to at least report
      warnings for such problematic code.
      9e81419e
    • Siarhei Siamashka's avatar
      sse2: bilinear fast path for src_x888_8888 · a863bbcc
      Siarhei Siamashka authored
      Running cairo-perf-trace benchmark on Intel Core2 T7300:
      
      Before:
      [  0]    image    t-firefox-canvas-swscroll    1.989    2.008   0.43%    8/8
      [  1]    image        firefox-canvas-scroll    4.574    4.609   0.50%    8/8
      
      After:
      [  0]    image    t-firefox-canvas-swscroll    1.404    1.418   0.51%    8/8
      [  1]    image        firefox-canvas-scroll    4.228    4.259   0.36%    8/8
      a863bbcc
  4. Oct 12, 2013
    • Søren Sandmann Pedersen's avatar
      configure.ac: Add check for pmulhuw assembly · 8f75f638
      Søren Sandmann Pedersen authored
      Clang 3.0 chokes on the following bit of assembly
      
          asm ("pmulhuw %1, %0\n\t"
              : "+y" (__A)
              : "y" (__B)
          );
      
      from pixman-mmx.c with this error message:
      
          fatal error: error in backend: Unsupported asm: input constraint
              with a matching output constraint of incompatible type!
      
      So add a check in configure to only enable MMX when the compiler can
      deal with it.
      8f75f638
    • Søren Sandmann Pedersen's avatar
      scale.c: Use int instead of kernel_t for values in named_int_t · 09a62d4d
      Søren Sandmann Pedersen authored
      The 'value' field in the 'named_int_t' struct is used for both
      pixman_repeat_t and pixman_kernel_t values, so the type should be int,
      not pixman_kernel_t.
      
      Fixes some warnings like this
      
      scale.c:124:33: warning: implicit conversion from enumeration
            type 'pixman_repeat_t' to different enumeration type
            'pixman_kernel_t' [-Wconversion]
          { "None",                   PIXMAN_REPEAT_NONE },
          ~                           ^~~~~~~~~~~~~~~~~~
      
      when compiled with clang.
      09a62d4d
    • Søren Sandmann Pedersen's avatar
      pixman-combine32.c: Make Color Burn routine follow the math more closely · 93672438
      Søren Sandmann Pedersen authored
      For superluminescent destinations, the old code could underflow in
      
          uint32_t r = (ad - d) * as / s;
      
      when (ad - d) was negative. The new code avoids this problem (and
      therefore causes changes in the checksums of thread-test and
      blitters-test), but it is likely still buggy due to the use of
      unsigned variables and other issues in the blend mode code.
      93672438
    • Søren Sandmann Pedersen's avatar
      pixman-combine32: Make Color Dodge routine follow the math more closely · 105fa74f
      Søren Sandmann Pedersen authored
      Change blend_color_dodge() to follow the math in the comment more
      closely.
      
      Note, the new code here is in some sense worse than the old code
      because it can now underflow the unsigned variables when the source is
      superluminescent and (as - s) is therefore negative. The old code was
      careful to clamp to 0.
      
      But for superluminescent variables we really need the ability for the
      blend function to become negative, and so the solution the underflow
      problem is to just use signed variables. The use of unsigned variables
      is a general problem in all of the blend mode code that will have to
      be solved later.
      
      The CRC32 values in thread-test and blitters-test are updated to
      account for the changes in output.
      105fa74f
    • Søren Sandmann Pedersen's avatar
      pixman-combine32: Rename a number of variable from sa/sca to as/s · 2527a724
      Søren Sandmann Pedersen authored
      There are no semantic changes, just variables renames. The motivation
      for these renames is so that the names are shorter and better match
      the one used in the comments.
      2527a724
    • Søren Sandmann Pedersen's avatar
      pixman-combine32: Improve documentation for blend mode operators · eaa4778c
      Søren Sandmann Pedersen authored
      This commit overhauls the comments in pixman-comine32.c regarding
      blend modes:
      
      - Add a link to the PDF supplement that clarifies the specification of
        ColorBurn and ColorDodge
      
      - Clarify how the formulas for premultiplied colors are derived form
        the ones in the PDF specifications
      
      - Write out the derivation of the formulas in each blend routine
      eaa4778c
    • Søren Sandmann Pedersen's avatar
      pixman-combine32.c: Formatting fixes · 4bf1502f
      Søren Sandmann Pedersen authored
      Fix a bunch of spacing issues.
      
      V2: More spacing issues, in the _ca combiners
      4bf1502f
  5. Oct 09, 2013
    • Andrea Canciani's avatar
      Fix thread-test on non-OpenMP systems · 54be1a52
      Andrea Canciani authored
      The non-reentrant versions of prng_* functions are thread-safe only in
      OpenMP-enabled builds.
      
      Fixes thread-test failing when compiled with Clang (both on Linux and
      on MacOS).
      54be1a52
    • Andrea Canciani's avatar
      Add support for SSSE3 to the MSVC build system · 0af2fcae
      Andrea Canciani authored
      Handle SSSE3 just like MMX and SSE2.
      0af2fcae
    • Andrea Canciani's avatar
      Fix build of check-formats on MSVC · e4d9c623
      Andrea Canciani authored
      Fixes
      
      check-formats.obj : error LNK2019: unresolved external symbol
      _strcasecmp referenced in function _format_from_string
      
      check-formats.obj : error LNK2019: unresolved external symbol
      _snprintf referenced in function _list_operators
      e4d9c623
    • Andrea Canciani's avatar
      Fix building of "other" programs on MSVC · 96ad6ebd
      Andrea Canciani authored
      In d1434d11 the benchmarks have been
      extended to include other programs as well and the variable names have
      been updated accordingly in the autotools-based build system, but not
      in the MSVC one.
      96ad6ebd
    • Andrea Canciani's avatar
      Fix build on MSVC · 31ac784f
      Andrea Canciani authored
      After a4c79d69 the MMX and SSE2 code
      has some declarations after the beginning of a block, which is not
      allowed by MSVC.
      
      Fixes multiple errors like:
      
      pixman-mmx.c(3625) : error C2275: '__m64' : illegal use of this type
      as an expression
      
      pixman-sse2.c(5708) : error C2275: '__m128i' : illegal use of this
      type as an expression
      31ac784f
  6. Oct 04, 2013
  7. Oct 01, 2013
    • Siarhei Siamashka's avatar
      vmx: there is no need to handle unaligned destination anymore · 7d05a7f4
      Siarhei Siamashka authored
      So the redundant variables, memory reads/writes and reshuffles
      can be safely removed. For example, this makes the inner loop
      of 'vmx_combine_add_u_no_mask' function much more simple.
      
      Before:
      
          7a20:7d a8 48 ce lvx     v13,r8,r9
          7a24:7d 80 48 ce lvx     v12,r0,r9
          7a28:7d 28 50 ce lvx     v9,r8,r10
          7a2c:7c 20 50 ce lvx     v1,r0,r10
          7a30:39 4a 00 10 addi    r10,r10,16
          7a34:10 0d 62 eb vperm   v0,v13,v12,v11
          7a38:10 21 4a 2b vperm   v1,v1,v9,v8
          7a3c:11 2c 6a eb vperm   v9,v12,v13,v11
          7a40:10 21 4a 00 vaddubs v1,v1,v9
          7a44:11 a1 02 ab vperm   v13,v1,v0,v10
          7a48:10 00 0a ab vperm   v0,v0,v1,v10
          7a4c:7d a8 49 ce stvx    v13,r8,r9
          7a50:7c 00 49 ce stvx    v0,r0,r9
          7a54:39 29 00 10 addi    r9,r9,16
          7a58:42 00 ff c8 bdnz+   7a20 <.vmx_combine_add_u_no_mask+0x120>
      
      After:
      
          76c0:7c 00 48 ce lvx     v0,r0,r9
          76c4:7d a8 48 ce lvx     v13,r8,r9
          76c8:39 29 00 10 addi    r9,r9,16
          76cc:7c 20 50 ce lvx     v1,r0,r10
          76d0:10 00 6b 2b vperm   v0,v0,v13,v12
          76d4:10 00 0a 00 vaddubs v0,v0,v1
          76d8:7c 00 51 ce stvx    v0,r0,r10
          76dc:39 4a 00 10 addi    r10,r10,16
          76e0:42 00 ff e0 bdnz+   76c0 <.vmx_combine_add_u_no_mask+0x120>
      7d05a7f4
    • Siarhei Siamashka's avatar
      vmx: align destination to fix valgrind invalid memory writes · b6c5ba06
      Siarhei Siamashka authored
      The SIMD optimized inner loops in the VMX/Altivec code are trying
      to emulate unaligned accesses to the destination buffer. For each
      4 pixels (which fit into a 128-bit register) the current
      implementation:
        1. first performs two aligned reads, which cover the needed data
        2. reshuffles bytes to get the needed data in a single vector register
        3. does all the necessary calculations
        4. reshuffles bytes back to their original location in two registers
        5. performs two aligned writes back to the destination buffer
      
      Unfortunately in the case if the destination buffer is unaligned and
      the width is a perfect multiple of 4 pixels, we may have some writes
      crossing the boundaries of the destination buffer. In a multithreaded
      environment this may potentially corrupt the data outside of the
      destination buffer if it is concurrently read and written by some
      other thread.
      
      The valgrind report for blitters-test is full of:
      
      ==23085== Invalid write of size 8
      ==23085==    at 0x1004B0B4: vmx_combine_add_u (pixman-vmx.c:1089)
      ==23085==    by 0x100446EF: general_composite_rect (pixman-general.c:214)
      ==23085==    by 0x10002537: test_composite (blitters-test.c:363)
      ==23085==    by 0x1000369B: fuzzer_test_main._omp_fn.0 (utils.c:733)
      ==23085==    by 0x10004943: fuzzer_test_main (utils.c:728)
      ==23085==    by 0x10002C17: main (blitters-test.c:397)
      ==23085==  Address 0x5188218 is 0 bytes after a block of size 88 alloc'd
      ==23085==    at 0x4051DA0: memalign (vg_replace_malloc.c:581)
      ==23085==    by 0x4051E7B: posix_memalign (vg_replace_malloc.c:709)
      ==23085==    by 0x10004CFF: aligned_malloc (utils.c:833)
      ==23085==    by 0x10001DCB: create_random_image (blitters-test.c:47)
      ==23085==    by 0x10002263: test_composite (blitters-test.c:283)
      ==23085==    by 0x1000369B: fuzzer_test_main._omp_fn.0 (utils.c:733)
      ==23085==    by 0x10004943: fuzzer_test_main (utils.c:728)
      ==23085==    by 0x10002C17: main (blitters-test.c:397)
      
      This patch addresses the problem by first aligning the destination
      buffer at a 16 byte boundary in each combiner function. This trick
      is borrowed from the pixman SSE2 code.
      
      It allows to pass the new thread-test on PowerPC VMX/Altivec systems and
      also resolves the "make check" failure reported for POWER7 hardware:
          http://lists.freedesktop.org/archives/pixman/2013-August/002871.html
      b6c5ba06
    • Søren Sandmann Pedersen's avatar
      test: Add new thread-test program · 0438435b
      Søren Sandmann Pedersen authored and Siarhei Siamashka's avatar Siarhei Siamashka committed
      This test program allocates an array of 16 * 7 uint32_ts and spawns 16
      threads that each use 7 of the allocated uint32_ts as a destination
      image for a large number of composite operations. Each thread then
      computes and returns a checksum for the image. Finally, the main
      thread computes a checksum of the checksums and verifies that it
      matches expectations.
      
      The purpose of this test is catch errors where memory outside images
      is read and then written back. Such out-of-bounds accesses are broken
      when multiple threads are involved, because the threads will race to
      read and write the shared memory.
      
      V2:
      - Incorporate fixes from Siarhei for endianness and undefined behavior
        regarding argument evaluation
      - Make the images 7 pixels wide since the bug only happens when the
        composite width is greater than 4.
      - Compute a checksum of the checksums so that you don't have to
        update 16 values if something changes.
      
      V3: Remove stray dollar sign
      0438435b
    • Søren Sandmann Pedersen's avatar
      Rename HAVE_PTHREAD_SETSPECIFIC to HAVE_PTHREADS · 65829504
      Søren Sandmann Pedersen authored and Siarhei Siamashka's avatar Siarhei Siamashka committed
      The test for pthread_setspecific() can be used as a general test for
      whether pthreads are available, so rename the variable from
      HAVE_PTHREAD_SETSPECIFIC to HAVE_PTHREADS and run the test even when
      better support for thread local variables are available.
      
      However, the pthread arguments are still only added to CFLAGS and
      LDFLAGS when pthread_setspecific() is used for thread local variables.
      
      V2: AC_SUBST(PTHREAD_CFLAGS)
      65829504
  8. Sep 29, 2013
  9. Sep 27, 2013
  10. Sep 26, 2013
  11. Sep 20, 2013
  12. Sep 16, 2013
    • Søren Sandmann Pedersen's avatar
      pixman-filter.c: Use 65536, not 65535, for fixed point conversion · 75506e63
      Søren Sandmann Pedersen authored
      Converting a double precision number to 16.16 fixed point should be
      done by multiplying with 65536.0, not 65535.0.
      
      The bug could potentially cause certain filters that would otherwise
      leave the image bit-for-bit unchanged under an identity
      transformation, to not do so, but the numbers are close enough that
      there weren't any visual differences.
      75506e63
    • Søren Sandmann Pedersen's avatar
      demos/scale.ui: Allow subsample_bits to be 0 · 9899a7ba
      Søren Sandmann Pedersen authored
      The separable convolution filter supports a subsample_bits of 0 which
      corresponds to no subsampling at all, so allow this value to be used
      in the scale demo.
      9899a7ba
    • Søren Sandmann Pedersen's avatar
      ssse3: Add iterator for separable bilinear scaling · 58a79dfe
      Søren Sandmann Pedersen authored
      This new iterator uses the SSSE3 instructions pmaddubsw and pabsw to
      implement a fast iterator for bilinear scaling.
      
      There is a graph here recording the per-pixel time for various
      bilinear scaling algorithms as reported by scaling-bench:
      
          http://people.freedesktop.org/~sandmann/ssse3.v2/ssse3.v2.png
      
      As the graph shows, this new iterator is clearly faster than the
      existing C iterator, and when used with an SSE2 combiner, it is also
      faster than the existing SSE2 fast paths for upscaling, though not for
      downscaling.
      
      Another graph:
      
          http://people.freedesktop.org/~sandmann/ssse3.v2/movdqu.png
      
      shows the difference between writing to iter->buffer with movdqa,
      movdqu on an aligned buffer, and movdqu on a deliberately unaligned
      buffer. Since the differences are very small, the patch here avoids
      using movdqa because imposing alignment restrictions on iter->buffer
      may interfere with other optimizations, such as writing directly to
      the destination image.
      
      The data was measured with scaling-bench on a Sandy Bridge Core
      i3-2350M @ 2.3GHz and is available in this directory:
      
          http://people.freedesktop.org/~sandmann/ssse3.v2/
      
      where there is also a Gnumeric spreadsheet ssse3.v2.gnumeric
      containing the per-pixel values and the graph.
      
      V2:
      - Use uintptr_t instead of unsigned long in the ALIGN macro
      - Use _mm_storel_epi64 instead of _mm_cvtsi128_si64 as the latter form
        is not available on x86-32.
      - Use _mm_storeu_si128() instead of _mm_store_si128() to avoid
        imposing alignment requirements on iter->buffer
      58a79dfe
    • Søren Sandmann Pedersen's avatar
      Add empty SSSE3 implementation · f1792b32
      Søren Sandmann Pedersen authored
      This commit adds a new, empty SSSE3 implementation and the associated
      build system support.
      
      configure.ac:   detect whether the compiler understands SSSE3
                      intrinsics and set up the required CFLAGS
      
      Makefile.am:    Add libpixman-ssse3.la
      
      pixman-x86.c:   Add X86_SSSE3 feature flag and detect it in
                      detect_cpu_features().
      
      pixman-ssse3.c: New file with an empty SSSE3 implementation
      
      V2: Remove SSSE3_LDFLAGS since it isn't necessary unless Solaris
      support is added.
      f1792b32
    • Søren Sandmann Pedersen's avatar
      general: Ensure that iter buffers are aligned to 16 bytes · f10b5449
      Søren Sandmann Pedersen authored
      At the moment iter buffers are only guaranteed to be aligned to a 4
      byte boundary. SIMD implementations benefit from the buffers being
      aligned to 16 bytes, so ensure this is the case.
      
      V2:
      - Use uintptr_t instead of unsigned long
      - allocate 3 * SCANLINE_BUFFER_LENGTH byte on stack rather than just
        SCANLINE_BUFFER_LENGTH
      - use sizeof (stack_scanline_buffer) instead of SCANLINE_BUFFER_LENGTH
        to determine overflow
      f10b5449
    • Siarhei Siamashka's avatar
      sse2: faster bilinear scaling (pack 4 pixels to write with MOVDQA) · 700db9d8
      Siarhei Siamashka authored
      The loops are already unrolled, so it was just a matter of packing
      4 pixels into a single XMM register and doing aligned 128-bit
      writes to memory via MOVDQA instructions for the SRC compositing
      operator fast path. For the other fast paths, this XMM register
      is also directly routed to further processing instead of doing
      extra reshuffling. This replaces "8 PACKSSDW/PACKUSWB + 4 MOVD"
      instructions with "3 PACKSSDW/PACKUSWB + 1 MOVDQA" per 4 pixels,
      which results in a clear performance improvement.
      
      There are also some other (less important) tweaks:
      
      1. Convert 'pixman_fixed_t' to 'intptr_t' before using it as an
         index for addressing memory. The problem is that 'pixman_fixed_t'
         is a 32-bit data type and it has to be extended to 64-bit
         offsets, which needs extra instructions on 64-bit systems.
      
      2. Allow to recalculate the horizontal interpolation weights only
         once per 4 pixels by treating the XMM register as four pairs
         of 16-bit values. Each of these 16-bit/16-bit pairs can be
         replicated to fill the whole 128-bit register by using PSHUFD
         instructions. So we get "3 PADDW/PSRLW + 4 PSHUFD" instructions
         per 4 pixels instead of "12 PADDW/PSRLW" per 4 pixels
         (or "3 PADDW/PSRLW" per each pixel).
      
         Now a good question is whether replacing "9 PADDW/PSRLW" with
         "4 PSHUFD" is a favourable exchange. As it turns out, PSHUFD
         instructions are very fast on new Intel processors (including
         Atoms), but are rather slow on the first generation of Core2
         (Merom) and on the other processors from that time or older.
         A good instructions latency/throughput table, covering all the
         relevant processors, can be found at:
              http://www.agner.org/optimize/instruction_tables.pdf
      
         Enabling this optimization is controlled by the PSHUFD_IS_FAST
         define in "pixman-sse2.c".
      
      3. One use of PSHUFD instruction (_mm_shuffle_epi32 intrinsic) in
         the older code has been also replaced by PUNPCKLQDQ equivalent
         (_mm_unpacklo_epi64 intrinsic) in PSHUFD_IS_FAST=0 configuration.
         The PUNPCKLQDQ instruction is usually faster on older processors,
         but has some side effects (instead of fully overwriting the
         destination register like PSHUFD does, it retains half of the
         original value, which may inhibit some compiler optimizations).
      
      Benchmarks with "lowlevel-blt-bench -b src_8888_8888" using GCC 4.8.1 on
      x86-64 system and default optimizations. The results are in MPix/s:
      
      ====== Intel Core2 T7300 (2GHz) ======
      
      old:                     src_8888_8888 =  L1: 128.69  L2: 125.07  M:124.86
                              over_8888_8888 =  L1:  83.19  L2:  81.73  M: 80.63
                            over_8888_n_8888 =  L1:  79.56  L2:  78.61  M: 77.85
                            over_8888_8_8888 =  L1:  77.15  L2:  75.79  M: 74.63
      
      new (PSHUFD_IS_FAST=0):  src_8888_8888 =  L1: 168.67  L2: 163.26  M:162.44
                              over_8888_8888 =  L1: 102.91  L2: 100.43  M: 99.01
                            over_8888_n_8888 =  L1:  97.40  L2:  95.64  M: 94.24
                            over_8888_8_8888 =  L1:  98.04  L2:  95.83  M: 94.33
      
      new (PSHUFD_IS_FAST=1):  src_8888_8888 =  L1: 154.67  L2: 149.16  M:148.48
                              over_8888_8888 =  L1:  95.97  L2:  93.90  M: 91.85
                            over_8888_n_8888 =  L1:  93.18  L2:  91.47  M: 90.15
                            over_8888_8_8888 =  L1:  95.33  L2:  93.32  M: 91.42
      
      ====== Intel Core i7 860 (2.8GHz) ======
      
      old:                     src_8888_8888 =  L1: 323.48  L2: 318.86  M:314.81
                              over_8888_8888 =  L1: 187.38  L2: 186.74  M:182.46
      
      new (PSHUFD_IS_FAST=0):  src_8888_8888 =  L1: 373.06  L2: 370.94  M:368.32
                              over_8888_8888 =  L1: 217.28  L2: 215.57  M:211.32
      
      new (PSHUFD_IS_FAST=1):  src_8888_8888 =  L1: 401.98  L2: 397.65  M:395.61
                              over_8888_8888 =  L1: 218.89  L2: 217.56  M:213.48
      
      The most interesting benchmark is "src_8888_8888" (because this code can
      be reused for a generic non-separable SSE2 bilinear fetch iterator).
      
      The results shows that PSHUFD instructions are bad for Intel Core2 T7300
      (Merom core) and good for Intel Core i7 860 (Nehalem core). Both of these
      processors support SSSE3 instructions though, so they are not the primary
      targets for SSE2 code. But without having any other more relevant hardware
      to test, PSHUFD_IS_FAST=0 seems to be a reasonable default for SSE2 code
      and old processors (until the runtime CPU features detection becomes
      clever enough to recognize different microarchitectures).
      
      (Rebased on top of patch that removes support for 8-bit bilinear
       filtering -ssp)
      700db9d8
Loading