Commits · 99e8605be08d4fa46eabf76355fc4e953f60b44f · Richard Henderson / pixman

Nov 02, 2013

Pre-release version bump to 0.31.2 · 99e8605b
Søren Sandmann Pedersen authored 11 years ago

View commits for tag pixman-0.31.2 pixman-0.31.2

99e8605b

pixman_trapezoid_valid(): Fix underflow when bottom is close to MIN_INT · 5e14da97

Ritesh Khadgaray authored 11 years ago

If t->bottom is close to MIN_INT (probably invalid value), subtracting
top can lead to underflow which causes crashes.  Attached patch will
fix the issue.

This fixes bug 67484.

5e14da97

test/trap-crasher.c: Add trapezoid that demonstrates a crash · 2f876cf8
Søren Sandmann Pedersen authored 11 years ago
```
This trapezoid causes a crash due to an underflow in the
pixman_trapezoid_valid().

Test case from Ritesh Khadgaray.
```
2f876cf8

Fix pixman build with older GCC releases · 8ef7e0d1

Brad Smith authored 11 years ago

The following patch fixes building pixman with older GCC releases
such as GCC 3.3 and older (OpenBSD; some older archs use GCC 3.3.6)
by changing the method of detecting the presence of __builtin_clz
to utilizing an autoconf check to determine its presence. Compilers
that pretend to be GCC, implement __builtin_clz and are already
utilizing the intrinsic include LLVM/Clang, Open64, EKOPath and
PCC.

8ef7e0d1

Oct 17, 2013

pixman-glyph.c: Add __force_align_arg_pointer to composite functions · 3c2f4b65

Søren Sandmann Pedersen authored 11 years ago

The functions pixman_composite_glyphs_no_mask() and
pixman_composite_glyphs() can call into code compiled with -msse2,
which requires the stack to be aligned to 16 bytes. Since the ABIs on
Windows and Linux for x86-32 don't provide this guarantee, we need to
use this attribute to make GCC generate a prologue that realigns the
stack.

This fixes the crash introduced in the previous commit and also

   https://bugs.freedesktop.org/show_bug.cgi?id=70348

and

   https://bugs.freedesktop.org/show_bug.cgi?id=68300

3c2f4b65

utils.c: On x86-32 unalign the stack before calling test_function · 3dce2297

Søren Sandmann Pedersen authored 11 years ago

GCC when compiling with -msse2 and -mssse3 will assume that the stack
is aligned to 16 bytes even on x86-32 and accordingly issue movdqa
instructions for stack allocated variables.

But despite what GCC thinks, the standard ABI on x86-32 only requires
a 4-byte aligned stack. This is true at least on Windows, but there
also was (and maybe still is) Linux code in the wild that assumed
this. When such code calls into pixman and hits something compiled
with -msse2, we get a segfault from the unaligned movdqas.

Pixman has worked around this issue in the past with the gcc attribute
"force_align_arg_pointer" but the problem has resurfaced now in

    https://bugs.freedesktop.org/show_bug.cgi?id=68300

because pixman_composite_glyphs() is missing this attribute.

This patch makes fuzzer_test_main() call the test_function through a
trampoline, which, on x86-32, has a bit of assembly that deliberately
avoids aligning the stack to 16 bytes as GCC normally expects. The
result is that glyph-test now crashes.

V2: Mark caller-save registers as clobbered, rather than using
noinline on the trampoline.

3dce2297

Oct 13, 2013

configure.ac: check and use -Wdeclaration-after-statement GCC option · 9e81419e

Siarhei Siamashka authored 11 years ago

The accidental use of declaration after statement breaks compilation
with C89 compilers such as MSVC. Assuming that MSVC is one of the
supported compilers, it makes sense to ask GCC to at least report
warnings for such problematic code.

9e81419e

sse2: bilinear fast path for src_x888_8888 · a863bbcc

Siarhei Siamashka authored 11 years ago

Running cairo-perf-trace benchmark on Intel Core2 T7300:

Before:
[  0]    image    t-firefox-canvas-swscroll    1.989    2.008   0.43%    8/8
[  1]    image        firefox-canvas-scroll    4.574    4.609   0.50%    8/8

After:
[  0]    image    t-firefox-canvas-swscroll    1.404    1.418   0.51%    8/8
[  1]    image        firefox-canvas-scroll    4.228    4.259   0.36%    8/8

a863bbcc

Oct 12, 2013

configure.ac: Add check for pmulhuw assembly · 8f75f638

Søren Sandmann Pedersen authored 11 years ago

Clang 3.0 chokes on the following bit of assembly

    asm ("pmulhuw %1, %0\n\t"
        : "+y" (__A)
        : "y" (__B)
    );

from pixman-mmx.c with this error message:

    fatal error: error in backend: Unsupported asm: input constraint
        with a matching output constraint of incompatible type!

So add a check in configure to only enable MMX when the compiler can
deal with it.

8f75f638

scale.c: Use int instead of kernel_t for values in named_int_t · 09a62d4d

Søren Sandmann Pedersen authored 11 years ago

The 'value' field in the 'named_int_t' struct is used for both
pixman_repeat_t and pixman_kernel_t values, so the type should be int,
not pixman_kernel_t.

Fixes some warnings like this

scale.c:124:33: warning: implicit conversion from enumeration
      type 'pixman_repeat_t' to different enumeration type
      'pixman_kernel_t' [-Wconversion]
    { "None",                   PIXMAN_REPEAT_NONE },
    ~                           ^~~~~~~~~~~~~~~~~~

when compiled with clang.

09a62d4d

pixman-combine32.c: Make Color Burn routine follow the math more closely · 93672438

Søren Sandmann Pedersen authored 11 years ago

For superluminescent destinations, the old code could underflow in

    uint32_t r = (ad - d) * as / s;

when (ad - d) was negative. The new code avoids this problem (and
therefore causes changes in the checksums of thread-test and
blitters-test), but it is likely still buggy due to the use of
unsigned variables and other issues in the blend mode code.

93672438

pixman-combine32: Make Color Dodge routine follow the math more closely · 105fa74f

Søren Sandmann Pedersen authored 11 years ago

Change blend_color_dodge() to follow the math in the comment more
closely.

Note, the new code here is in some sense worse than the old code
because it can now underflow the unsigned variables when the source is
superluminescent and (as - s) is therefore negative. The old code was
careful to clamp to 0.

But for superluminescent variables we really need the ability for the
blend function to become negative, and so the solution the underflow
problem is to just use signed variables. The use of unsigned variables
is a general problem in all of the blend mode code that will have to
be solved later.

The CRC32 values in thread-test and blitters-test are updated to
account for the changes in output.

105fa74f

pixman-combine32: Rename a number of variable from sa/sca to as/s · 2527a724

Søren Sandmann Pedersen authored 11 years ago

There are no semantic changes, just variables renames. The motivation
for these renames is so that the names are shorter and better match
the one used in the comments.

2527a724

pixman-combine32: Improve documentation for blend mode operators · eaa4778c

Søren Sandmann Pedersen authored 11 years ago

This commit overhauls the comments in pixman-comine32.c regarding
blend modes:

- Add a link to the PDF supplement that clarifies the specification of
  ColorBurn and ColorDodge

- Clarify how the formulas for premultiplied colors are derived form
  the ones in the PDF specifications

- Write out the derivation of the formulas in each blend routine

eaa4778c

pixman-combine32.c: Formatting fixes · 4bf1502f
Søren Sandmann Pedersen authored 11 years ago
```
Fix a bunch of spacing issues.

V2: More spacing issues, in the _ca combiners
```
4bf1502f

Oct 09, 2013

Fix thread-test on non-OpenMP systems · 54be1a52

Andrea Canciani authored 11 years ago

The non-reentrant versions of prng_* functions are thread-safe only in
OpenMP-enabled builds.

Fixes thread-test failing when compiled with Clang (both on Linux and
on MacOS).

54be1a52

Add support for SSSE3 to the MSVC build system · 0af2fcae
Andrea Canciani authored 11 years ago
```
Handle SSSE3 just like MMX and SSE2.
```
0af2fcae

Fix build of check-formats on MSVC · e4d9c623

Andrea Canciani authored 11 years ago

Fixes

check-formats.obj : error LNK2019: unresolved external symbol
_strcasecmp referenced in function _format_from_string

check-formats.obj : error LNK2019: unresolved external symbol
_snprintf referenced in function _list_operators

e4d9c623

Fix building of "other" programs on MSVC · 96ad6ebd

Andrea Canciani authored 11 years ago

In d1434d11 the benchmarks have been
extended to include other programs as well and the variable names have
been updated accordingly in the autotools-based build system, but not
in the MSVC one.

96ad6ebd

Fix build on MSVC · 31ac784f

Andrea Canciani authored 11 years ago

After a4c79d69 the MMX and SSE2 code
has some declarations after the beginning of a block, which is not
allowed by MSVC.

Fixes multiple errors like:

pixman-mmx.c(3625) : error C2275: '__m64' : illegal use of this type
as an expression

pixman-sse2.c(5708) : error C2275: '__m128i' : illegal use of this
type as an expression

31ac784f

Oct 04, 2013

fast: Swap image and iter flags in generated fast paths · c89f4c82

Søren Sandmann Pedersen authored 11 years ago

The generated fast paths that were moved into the 'fast'
implementation in ec0e38cb had their
image and iter flag arguments swapped; as a result, none of the fast
paths were ever called.

c89f4c82

Oct 01, 2013

vmx: there is no need to handle unaligned destination anymore · 7d05a7f4

Siarhei Siamashka authored 11 years ago

So the redundant variables, memory reads/writes and reshuffles
can be safely removed. For example, this makes the inner loop
of 'vmx_combine_add_u_no_mask' function much more simple.

Before:

    7a20:7d a8 48 ce lvx     v13,r8,r9
    7a24:7d 80 48 ce lvx     v12,r0,r9
    7a28:7d 28 50 ce lvx     v9,r8,r10
    7a2c:7c 20 50 ce lvx     v1,r0,r10
    7a30:39 4a 00 10 addi    r10,r10,16
    7a34:10 0d 62 eb vperm   v0,v13,v12,v11
    7a38:10 21 4a 2b vperm   v1,v1,v9,v8
    7a3c:11 2c 6a eb vperm   v9,v12,v13,v11
    7a40:10 21 4a 00 vaddubs v1,v1,v9
    7a44:11 a1 02 ab vperm   v13,v1,v0,v10
    7a48:10 00 0a ab vperm   v0,v0,v1,v10
    7a4c:7d a8 49 ce stvx    v13,r8,r9
    7a50:7c 00 49 ce stvx    v0,r0,r9
    7a54:39 29 00 10 addi    r9,r9,16
    7a58:42 00 ff c8 bdnz+   7a20 <.vmx_combine_add_u_no_mask+0x120>

After:

    76c0:7c 00 48 ce lvx     v0,r0,r9
    76c4:7d a8 48 ce lvx     v13,r8,r9
    76c8:39 29 00 10 addi    r9,r9,16
    76cc:7c 20 50 ce lvx     v1,r0,r10
    76d0:10 00 6b 2b vperm   v0,v0,v13,v12
    76d4:10 00 0a 00 vaddubs v0,v0,v1
    76d8:7c 00 51 ce stvx    v0,r0,r10
    76dc:39 4a 00 10 addi    r10,r10,16
    76e0:42 00 ff e0 bdnz+   76c0 <.vmx_combine_add_u_no_mask+0x120>

7d05a7f4

vmx: align destination to fix valgrind invalid memory writes · b6c5ba06

Siarhei Siamashka authored 11 years ago

The SIMD optimized inner loops in the VMX/Altivec code are trying
to emulate unaligned accesses to the destination buffer. For each
4 pixels (which fit into a 128-bit register) the current
implementation:
  1. first performs two aligned reads, which cover the needed data
  2. reshuffles bytes to get the needed data in a single vector register
  3. does all the necessary calculations
  4. reshuffles bytes back to their original location in two registers
  5. performs two aligned writes back to the destination buffer

Unfortunately in the case if the destination buffer is unaligned and
the width is a perfect multiple of 4 pixels, we may have some writes
crossing the boundaries of the destination buffer. In a multithreaded
environment this may potentially corrupt the data outside of the
destination buffer if it is concurrently read and written by some
other thread.

The valgrind report for blitters-test is full of:

==23085== Invalid write of size 8
==23085==    at 0x1004B0B4: vmx_combine_add_u (pixman-vmx.c:1089)
==23085==    by 0x100446EF: general_composite_rect (pixman-general.c:214)
==23085==    by 0x10002537: test_composite (blitters-test.c:363)
==23085==    by 0x1000369B: fuzzer_test_main._omp_fn.0 (utils.c:733)
==23085==    by 0x10004943: fuzzer_test_main (utils.c:728)
==23085==    by 0x10002C17: main (blitters-test.c:397)
==23085==  Address 0x5188218 is 0 bytes after a block of size 88 alloc'd
==23085==    at 0x4051DA0: memalign (vg_replace_malloc.c:581)
==23085==    by 0x4051E7B: posix_memalign (vg_replace_malloc.c:709)
==23085==    by 0x10004CFF: aligned_malloc (utils.c:833)
==23085==    by 0x10001DCB: create_random_image (blitters-test.c:47)
==23085==    by 0x10002263: test_composite (blitters-test.c:283)
==23085==    by 0x1000369B: fuzzer_test_main._omp_fn.0 (utils.c:733)
==23085==    by 0x10004943: fuzzer_test_main (utils.c:728)
==23085==    by 0x10002C17: main (blitters-test.c:397)

This patch addresses the problem by first aligning the destination
buffer at a 16 byte boundary in each combiner function. This trick
is borrowed from the pixman SSE2 code.

It allows to pass the new thread-test on PowerPC VMX/Altivec systems and
also resolves the "make check" failure reported for POWER7 hardware:
    http://lists.freedesktop.org/archives/pixman/2013-August/002871.html

b6c5ba06

test: Add new thread-test program · 0438435b

Søren Sandmann Pedersen authored 11 years ago and

Siarhei Siamashka committed 11 years ago

This test program allocates an array of 16 * 7 uint32_ts and spawns 16
threads that each use 7 of the allocated uint32_ts as a destination
image for a large number of composite operations. Each thread then
computes and returns a checksum for the image. Finally, the main
thread computes a checksum of the checksums and verifies that it
matches expectations.

The purpose of this test is catch errors where memory outside images
is read and then written back. Such out-of-bounds accesses are broken
when multiple threads are involved, because the threads will race to
read and write the shared memory.

V2:
- Incorporate fixes from Siarhei for endianness and undefined behavior
  regarding argument evaluation
- Make the images 7 pixels wide since the bug only happens when the
  composite width is greater than 4.
- Compute a checksum of the checksums so that you don't have to
  update 16 values if something changes.

V3: Remove stray dollar sign

0438435b

Rename HAVE_PTHREAD_SETSPECIFIC to HAVE_PTHREADS · 65829504

Søren Sandmann Pedersen authored 11 years ago and

Siarhei Siamashka committed 11 years ago

The test for pthread_setspecific() can be used as a general test for
whether pthreads are available, so rename the variable from
HAVE_PTHREAD_SETSPECIFIC to HAVE_PTHREADS and run the test even when
better support for thread local variables are available.

However, the pthread arguments are still only added to CFLAGS and
LDFLAGS when pthread_setspecific() is used for thread local variables.

V2: AC_SUBST(PTHREAD_CFLAGS)

65829504

Sep 29, 2013
- blitters-test: Remove unused variable · b513b3df
  Søren Sandmann Pedersen authored 11 years ago
  
  b513b3df
Sep 27, 2013

utils.c: Make image_endian_swap() deal with negative strides · fa0559eb

Søren Sandmann Pedersen authored 11 years ago

Use a temporary variable s containing the absolute value of the stride
as the upper bound in the inner loops.

V2: Do this for the bpp == 16 case as well

fa0559eb

Sep 26, 2013

utils.c: Make print_image actually cope with negative strides · ff682089

Søren Sandmann Pedersen authored 11 years ago

Commit 4312f077 claimed to have made
print_image() work with negative strides, but it didn't actually
work. When the stride was negative, the image buffer would be accessed
as if the stride were positive.

Fix the bug by not changing the stride variable and instead using a
temporary, s, that contains the absolute value of stride.

ff682089

Move generated affine fetchers into pixman-fast-path.c · ec0e38cb

Søren Sandmann Pedersen authored 11 years ago

The generated fetchers for NEAREST, BILINEAR, and
SEPARABLE_CONVOLUTION filters are fast paths and so they belong in
pixman-fast-path.c

ec0e38cb

Move bits_image_fetch_bilinear_no_repeat_8888 into pixman-fast-path.c · 96e163d2
Søren Sandmann Pedersen authored 11 years ago
```
This iterator is really a fast path, so it belongs in the fast path
implementation.
```
96e163d2

fast, ssse3: Simplify logic to fetch lines in the bilinear iterators · 8d465c2a

Søren Sandmann Pedersen authored 11 years ago

Instead of having logic to swap the lines around when one of them
doesn't match, store the two lines in an array and use the least
significant bit of the y coordinate as the index into that
array. Since the two lines always have different least significant
bits, they will never collide.

The effect is that lines corresponding to even y coordinates are
stored in info->lines[0] and lines corresponding to odd y coordinates
are stored in info->lines[1].

8d465c2a

Sep 20, 2013

test: Test negative strides · aa5c4525

Søren Sandmann Pedersen authored 11 years ago

Pixman supports negative strides, but up until now they haven't been
tested outside of stress-test. This commit adds testing of negative
strides to blitters-test, scaling-test, affine-test, rotate-test, and
composite-traps-test.

aa5c4525

test: Share the image printing code · 4312f077

Søren Sandmann Pedersen authored 11 years ago

The affine-test, blitters-test, and scaling-test all have the ability
to print out the bytes of the destination image. Share this code by
moving it to utils.c.

At the same time make the code work correctly with negative strides.

4312f077

{scaling,affine,composite-traps}-test: Use compute_crc32_for_image() · 51d71354
Søren Sandmann Pedersen authored 11 years ago
```
By using this function instead of compute_crc32() the alpha masking
code and the call to image_endian_swap() are not duplicated.
```
51d71354

Sep 16, 2013

pixman-filter.c: Use 65536, not 65535, for fixed point conversion · 75506e63

Søren Sandmann Pedersen authored 11 years ago

Converting a double precision number to 16.16 fixed point should be
done by multiplying with 65536.0, not 65535.0.

The bug could potentially cause certain filters that would otherwise
leave the image bit-for-bit unchanged under an identity
transformation, to not do so, but the numbers are close enough that
there weren't any visual differences.

75506e63

demos/scale.ui: Allow subsample_bits to be 0 · 9899a7ba

Søren Sandmann Pedersen authored 11 years ago

The separable convolution filter supports a subsample_bits of 0 which
corresponds to no subsampling at all, so allow this value to be used
in the scale demo.

9899a7ba

ssse3: Add iterator for separable bilinear scaling · 58a79dfe

Søren Sandmann Pedersen authored 11 years ago

This new iterator uses the SSSE3 instructions pmaddubsw and pabsw to
implement a fast iterator for bilinear scaling.

There is a graph here recording the per-pixel time for various
bilinear scaling algorithms as reported by scaling-bench:

    http://people.freedesktop.org/~sandmann/ssse3.v2/ssse3.v2.png

As the graph shows, this new iterator is clearly faster than the
existing C iterator, and when used with an SSE2 combiner, it is also
faster than the existing SSE2 fast paths for upscaling, though not for
downscaling.

Another graph:

    http://people.freedesktop.org/~sandmann/ssse3.v2/movdqu.png

shows the difference between writing to iter->buffer with movdqa,
movdqu on an aligned buffer, and movdqu on a deliberately unaligned
buffer. Since the differences are very small, the patch here avoids
using movdqa because imposing alignment restrictions on iter->buffer
may interfere with other optimizations, such as writing directly to
the destination image.

The data was measured with scaling-bench on a Sandy Bridge Core
i3-2350M @ 2.3GHz and is available in this directory:

    http://people.freedesktop.org/~sandmann/ssse3.v2/

where there is also a Gnumeric spreadsheet ssse3.v2.gnumeric
containing the per-pixel values and the graph.

V2:
- Use uintptr_t instead of unsigned long in the ALIGN macro
- Use _mm_storel_epi64 instead of _mm_cvtsi128_si64 as the latter form
  is not available on x86-32.
- Use _mm_storeu_si128() instead of _mm_store_si128() to avoid
  imposing alignment requirements on iter->buffer

58a79dfe

Add empty SSSE3 implementation · f1792b32

Søren Sandmann Pedersen authored 11 years ago

This commit adds a new, empty SSSE3 implementation and the associated
build system support.

configure.ac:   detect whether the compiler understands SSSE3
                intrinsics and set up the required CFLAGS

Makefile.am:    Add libpixman-ssse3.la

pixman-x86.c:   Add X86_SSSE3 feature flag and detect it in
                detect_cpu_features().

pixman-ssse3.c: New file with an empty SSSE3 implementation

V2: Remove SSSE3_LDFLAGS since it isn't necessary unless Solaris
support is added.

f1792b32

general: Ensure that iter buffers are aligned to 16 bytes · f10b5449

Søren Sandmann Pedersen authored 11 years ago

At the moment iter buffers are only guaranteed to be aligned to a 4
byte boundary. SIMD implementations benefit from the buffers being
aligned to 16 bytes, so ensure this is the case.

V2:
- Use uintptr_t instead of unsigned long
- allocate 3 * SCANLINE_BUFFER_LENGTH byte on stack rather than just
  SCANLINE_BUFFER_LENGTH
- use sizeof (stack_scanline_buffer) instead of SCANLINE_BUFFER_LENGTH
  to determine overflow

f10b5449

sse2: faster bilinear scaling (pack 4 pixels to write with MOVDQA) · 700db9d8

Siarhei Siamashka authored 11 years ago

The loops are already unrolled, so it was just a matter of packing
4 pixels into a single XMM register and doing aligned 128-bit
writes to memory via MOVDQA instructions for the SRC compositing
operator fast path. For the other fast paths, this XMM register
is also directly routed to further processing instead of doing
extra reshuffling. This replaces "8 PACKSSDW/PACKUSWB + 4 MOVD"
instructions with "3 PACKSSDW/PACKUSWB + 1 MOVDQA" per 4 pixels,
which results in a clear performance improvement.

There are also some other (less important) tweaks:

1. Convert 'pixman_fixed_t' to 'intptr_t' before using it as an
   index for addressing memory. The problem is that 'pixman_fixed_t'
   is a 32-bit data type and it has to be extended to 64-bit
   offsets, which needs extra instructions on 64-bit systems.

2. Allow to recalculate the horizontal interpolation weights only
   once per 4 pixels by treating the XMM register as four pairs
   of 16-bit values. Each of these 16-bit/16-bit pairs can be
   replicated to fill the whole 128-bit register by using PSHUFD
   instructions. So we get "3 PADDW/PSRLW + 4 PSHUFD" instructions
   per 4 pixels instead of "12 PADDW/PSRLW" per 4 pixels
   (or "3 PADDW/PSRLW" per each pixel).

   Now a good question is whether replacing "9 PADDW/PSRLW" with
   "4 PSHUFD" is a favourable exchange. As it turns out, PSHUFD
   instructions are very fast on new Intel processors (including
   Atoms), but are rather slow on the first generation of Core2
   (Merom) and on the other processors from that time or older.
   A good instructions latency/throughput table, covering all the
   relevant processors, can be found at:
        http://www.agner.org/optimize/instruction_tables.pdf

   Enabling this optimization is controlled by the PSHUFD_IS_FAST
   define in "pixman-sse2.c".

3. One use of PSHUFD instruction (_mm_shuffle_epi32 intrinsic) in
   the older code has been also replaced by PUNPCKLQDQ equivalent
   (_mm_unpacklo_epi64 intrinsic) in PSHUFD_IS_FAST=0 configuration.
   The PUNPCKLQDQ instruction is usually faster on older processors,
   but has some side effects (instead of fully overwriting the
   destination register like PSHUFD does, it retains half of the
   original value, which may inhibit some compiler optimizations).

Benchmarks with "lowlevel-blt-bench -b src_8888_8888" using GCC 4.8.1 on
x86-64 system and default optimizations. The results are in MPix/s:

====== Intel Core2 T7300 (2GHz) ======

old:                     src_8888_8888 =  L1: 128.69  L2: 125.07  M:124.86
                        over_8888_8888 =  L1:  83.19  L2:  81.73  M: 80.63
                      over_8888_n_8888 =  L1:  79.56  L2:  78.61  M: 77.85
                      over_8888_8_8888 =  L1:  77.15  L2:  75.79  M: 74.63

new (PSHUFD_IS_FAST=0):  src_8888_8888 =  L1: 168.67  L2: 163.26  M:162.44
                        over_8888_8888 =  L1: 102.91  L2: 100.43  M: 99.01
                      over_8888_n_8888 =  L1:  97.40  L2:  95.64  M: 94.24
                      over_8888_8_8888 =  L1:  98.04  L2:  95.83  M: 94.33

new (PSHUFD_IS_FAST=1):  src_8888_8888 =  L1: 154.67  L2: 149.16  M:148.48
                        over_8888_8888 =  L1:  95.97  L2:  93.90  M: 91.85
                      over_8888_n_8888 =  L1:  93.18  L2:  91.47  M: 90.15
                      over_8888_8_8888 =  L1:  95.33  L2:  93.32  M: 91.42

====== Intel Core i7 860 (2.8GHz) ======

old:                     src_8888_8888 =  L1: 323.48  L2: 318.86  M:314.81
                        over_8888_8888 =  L1: 187.38  L2: 186.74  M:182.46

new (PSHUFD_IS_FAST=0):  src_8888_8888 =  L1: 373.06  L2: 370.94  M:368.32
                        over_8888_8888 =  L1: 217.28  L2: 215.57  M:211.32

new (PSHUFD_IS_FAST=1):  src_8888_8888 =  L1: 401.98  L2: 397.65  M:395.61
                        over_8888_8888 =  L1: 218.89  L2: 217.56  M:213.48

The most interesting benchmark is "src_8888_8888" (because this code can
be reused for a generic non-separable SSE2 bilinear fetch iterator).

The results shows that PSHUFD instructions are bad for Intel Core2 T7300
(Merom core) and good for Intel Core i7 860 (Nehalem core). Both of these
processors support SSSE3 instructions though, so they are not the primary
targets for SSE2 code. But without having any other more relevant hardware
to test, PSHUFD_IS_FAST=0 seems to be a reasonable default for SSE2 code
and old processors (until the runtime CPU features detection becomes
clever enough to recognize different microarchitectures).

(Rebased on top of patch that removes support for 8-bit bilinear
 filtering -ssp)

700db9d8

Admin message

Admin message