- Nov 02, 2013
-
-
Ritesh Khadgaray authored
If t->bottom is close to MIN_INT (probably invalid value), subtracting top can lead to underflow which causes crashes. Attached patch will fix the issue. This fixes bug 67484.
-
Søren Sandmann Pedersen authored
This trapezoid causes a crash due to an underflow in the pixman_trapezoid_valid(). Test case from Ritesh Khadgaray.
-
Brad Smith authored
The following patch fixes building pixman with older GCC releases such as GCC 3.3 and older (OpenBSD; some older archs use GCC 3.3.6) by changing the method of detecting the presence of __builtin_clz to utilizing an autoconf check to determine its presence. Compilers that pretend to be GCC, implement __builtin_clz and are already utilizing the intrinsic include LLVM/Clang, Open64, EKOPath and PCC.
- Oct 17, 2013
-
-
Søren Sandmann Pedersen authored
The functions pixman_composite_glyphs_no_mask() and pixman_composite_glyphs() can call into code compiled with -msse2, which requires the stack to be aligned to 16 bytes. Since the ABIs on Windows and Linux for x86-32 don't provide this guarantee, we need to use this attribute to make GCC generate a prologue that realigns the stack. This fixes the crash introduced in the previous commit and also https://bugs.freedesktop.org/show_bug.cgi?id=70348 and https://bugs.freedesktop.org/show_bug.cgi?id=68300
-
Søren Sandmann Pedersen authored
GCC when compiling with -msse2 and -mssse3 will assume that the stack is aligned to 16 bytes even on x86-32 and accordingly issue movdqa instructions for stack allocated variables. But despite what GCC thinks, the standard ABI on x86-32 only requires a 4-byte aligned stack. This is true at least on Windows, but there also was (and maybe still is) Linux code in the wild that assumed this. When such code calls into pixman and hits something compiled with -msse2, we get a segfault from the unaligned movdqas. Pixman has worked around this issue in the past with the gcc attribute "force_align_arg_pointer" but the problem has resurfaced now in https://bugs.freedesktop.org/show_bug.cgi?id=68300 because pixman_composite_glyphs() is missing this attribute. This patch makes fuzzer_test_main() call the test_function through a trampoline, which, on x86-32, has a bit of assembly that deliberately avoids aligning the stack to 16 bytes as GCC normally expects. The result is that glyph-test now crashes. V2: Mark caller-save registers as clobbered, rather than using noinline on the trampoline.
-
- Oct 13, 2013
-
-
Siarhei Siamashka authored
The accidental use of declaration after statement breaks compilation with C89 compilers such as MSVC. Assuming that MSVC is one of the supported compilers, it makes sense to ask GCC to at least report warnings for such problematic code.
-
Siarhei Siamashka authored
Running cairo-perf-trace benchmark on Intel Core2 T7300: Before: [ 0] image t-firefox-canvas-swscroll 1.989 2.008 0.43% 8/8 [ 1] image firefox-canvas-scroll 4.574 4.609 0.50% 8/8 After: [ 0] image t-firefox-canvas-swscroll 1.404 1.418 0.51% 8/8 [ 1] image firefox-canvas-scroll 4.228 4.259 0.36% 8/8
-
- Oct 12, 2013
-
-
Søren Sandmann Pedersen authored
Clang 3.0 chokes on the following bit of assembly asm ("pmulhuw %1, %0\n\t" : "+y" (__A) : "y" (__B) ); from pixman-mmx.c with this error message: fatal error: error in backend: Unsupported asm: input constraint with a matching output constraint of incompatible type! So add a check in configure to only enable MMX when the compiler can deal with it.
-
Søren Sandmann Pedersen authored
The 'value' field in the 'named_int_t' struct is used for both pixman_repeat_t and pixman_kernel_t values, so the type should be int, not pixman_kernel_t. Fixes some warnings like this scale.c:124:33: warning: implicit conversion from enumeration type 'pixman_repeat_t' to different enumeration type 'pixman_kernel_t' [-Wconversion] { "None", PIXMAN_REPEAT_NONE }, ~ ^~~~~~~~~~~~~~~~~~ when compiled with clang.
-
Søren Sandmann Pedersen authored
For superluminescent destinations, the old code could underflow in uint32_t r = (ad - d) * as / s; when (ad - d) was negative. The new code avoids this problem (and therefore causes changes in the checksums of thread-test and blitters-test), but it is likely still buggy due to the use of unsigned variables and other issues in the blend mode code.
-
Søren Sandmann Pedersen authored
Change blend_color_dodge() to follow the math in the comment more closely. Note, the new code here is in some sense worse than the old code because it can now underflow the unsigned variables when the source is superluminescent and (as - s) is therefore negative. The old code was careful to clamp to 0. But for superluminescent variables we really need the ability for the blend function to become negative, and so the solution the underflow problem is to just use signed variables. The use of unsigned variables is a general problem in all of the blend mode code that will have to be solved later. The CRC32 values in thread-test and blitters-test are updated to account for the changes in output.
-
Søren Sandmann Pedersen authored
There are no semantic changes, just variables renames. The motivation for these renames is so that the names are shorter and better match the one used in the comments.
-
Søren Sandmann Pedersen authored
This commit overhauls the comments in pixman-comine32.c regarding blend modes: - Add a link to the PDF supplement that clarifies the specification of ColorBurn and ColorDodge - Clarify how the formulas for premultiplied colors are derived form the ones in the PDF specifications - Write out the derivation of the formulas in each blend routine
-
Søren Sandmann Pedersen authored
Fix a bunch of spacing issues. V2: More spacing issues, in the _ca combiners
-
- Oct 09, 2013
-
-
Andrea Canciani authored
The non-reentrant versions of prng_* functions are thread-safe only in OpenMP-enabled builds. Fixes thread-test failing when compiled with Clang (both on Linux and on MacOS).
-
Andrea Canciani authored
Handle SSSE3 just like MMX and SSE2.
-
Andrea Canciani authored
Fixes check-formats.obj : error LNK2019: unresolved external symbol _strcasecmp referenced in function _format_from_string check-formats.obj : error LNK2019: unresolved external symbol _snprintf referenced in function _list_operators
-
Andrea Canciani authored
In d1434d11 the benchmarks have been extended to include other programs as well and the variable names have been updated accordingly in the autotools-based build system, but not in the MSVC one.
-
Andrea Canciani authored
After a4c79d69 the MMX and SSE2 code has some declarations after the beginning of a block, which is not allowed by MSVC. Fixes multiple errors like: pixman-mmx.c(3625) : error C2275: '__m64' : illegal use of this type as an expression pixman-sse2.c(5708) : error C2275: '__m128i' : illegal use of this type as an expression
-
- Oct 04, 2013
-
-
Søren Sandmann Pedersen authored
The generated fast paths that were moved into the 'fast' implementation in ec0e38cb had their image and iter flag arguments swapped; as a result, none of the fast paths were ever called.
-
- Oct 01, 2013
-
-
Siarhei Siamashka authored
So the redundant variables, memory reads/writes and reshuffles can be safely removed. For example, this makes the inner loop of 'vmx_combine_add_u_no_mask' function much more simple. Before: 7a20:7d a8 48 ce lvx v13,r8,r9 7a24:7d 80 48 ce lvx v12,r0,r9 7a28:7d 28 50 ce lvx v9,r8,r10 7a2c:7c 20 50 ce lvx v1,r0,r10 7a30:39 4a 00 10 addi r10,r10,16 7a34:10 0d 62 eb vperm v0,v13,v12,v11 7a38:10 21 4a 2b vperm v1,v1,v9,v8 7a3c:11 2c 6a eb vperm v9,v12,v13,v11 7a40:10 21 4a 00 vaddubs v1,v1,v9 7a44:11 a1 02 ab vperm v13,v1,v0,v10 7a48:10 00 0a ab vperm v0,v0,v1,v10 7a4c:7d a8 49 ce stvx v13,r8,r9 7a50:7c 00 49 ce stvx v0,r0,r9 7a54:39 29 00 10 addi r9,r9,16 7a58:42 00 ff c8 bdnz+ 7a20 <.vmx_combine_add_u_no_mask+0x120> After: 76c0:7c 00 48 ce lvx v0,r0,r9 76c4:7d a8 48 ce lvx v13,r8,r9 76c8:39 29 00 10 addi r9,r9,16 76cc:7c 20 50 ce lvx v1,r0,r10 76d0:10 00 6b 2b vperm v0,v0,v13,v12 76d4:10 00 0a 00 vaddubs v0,v0,v1 76d8:7c 00 51 ce stvx v0,r0,r10 76dc:39 4a 00 10 addi r10,r10,16 76e0:42 00 ff e0 bdnz+ 76c0 <.vmx_combine_add_u_no_mask+0x120>
-
Siarhei Siamashka authored
The SIMD optimized inner loops in the VMX/Altivec code are trying to emulate unaligned accesses to the destination buffer. For each 4 pixels (which fit into a 128-bit register) the current implementation: 1. first performs two aligned reads, which cover the needed data 2. reshuffles bytes to get the needed data in a single vector register 3. does all the necessary calculations 4. reshuffles bytes back to their original location in two registers 5. performs two aligned writes back to the destination buffer Unfortunately in the case if the destination buffer is unaligned and the width is a perfect multiple of 4 pixels, we may have some writes crossing the boundaries of the destination buffer. In a multithreaded environment this may potentially corrupt the data outside of the destination buffer if it is concurrently read and written by some other thread. The valgrind report for blitters-test is full of: ==23085== Invalid write of size 8 ==23085== at 0x1004B0B4: vmx_combine_add_u (pixman-vmx.c:1089) ==23085== by 0x100446EF: general_composite_rect (pixman-general.c:214) ==23085== by 0x10002537: test_composite (blitters-test.c:363) ==23085== by 0x1000369B: fuzzer_test_main._omp_fn.0 (utils.c:733) ==23085== by 0x10004943: fuzzer_test_main (utils.c:728) ==23085== by 0x10002C17: main (blitters-test.c:397) ==23085== Address 0x5188218 is 0 bytes after a block of size 88 alloc'd ==23085== at 0x4051DA0: memalign (vg_replace_malloc.c:581) ==23085== by 0x4051E7B: posix_memalign (vg_replace_malloc.c:709) ==23085== by 0x10004CFF: aligned_malloc (utils.c:833) ==23085== by 0x10001DCB: create_random_image (blitters-test.c:47) ==23085== by 0x10002263: test_composite (blitters-test.c:283) ==23085== by 0x1000369B: fuzzer_test_main._omp_fn.0 (utils.c:733) ==23085== by 0x10004943: fuzzer_test_main (utils.c:728) ==23085== by 0x10002C17: main (blitters-test.c:397) This patch addresses the problem by first aligning the destination buffer at a 16 byte boundary in each combiner function. This trick is borrowed from the pixman SSE2 code. It allows to pass the new thread-test on PowerPC VMX/Altivec systems and also resolves the "make check" failure reported for POWER7 hardware: http://lists.freedesktop.org/archives/pixman/2013-August/002871.html
-
This test program allocates an array of 16 * 7 uint32_ts and spawns 16 threads that each use 7 of the allocated uint32_ts as a destination image for a large number of composite operations. Each thread then computes and returns a checksum for the image. Finally, the main thread computes a checksum of the checksums and verifies that it matches expectations. The purpose of this test is catch errors where memory outside images is read and then written back. Such out-of-bounds accesses are broken when multiple threads are involved, because the threads will race to read and write the shared memory. V2: - Incorporate fixes from Siarhei for endianness and undefined behavior regarding argument evaluation - Make the images 7 pixels wide since the bug only happens when the composite width is greater than 4. - Compute a checksum of the checksums so that you don't have to update 16 values if something changes. V3: Remove stray dollar sign
-
The test for pthread_setspecific() can be used as a general test for whether pthreads are available, so rename the variable from HAVE_PTHREAD_SETSPECIFIC to HAVE_PTHREADS and run the test even when better support for thread local variables are available. However, the pthread arguments are still only added to CFLAGS and LDFLAGS when pthread_setspecific() is used for thread local variables. V2: AC_SUBST(PTHREAD_CFLAGS)
-
- Sep 29, 2013
-
-
Søren Sandmann Pedersen authored
-
- Sep 27, 2013
-
-
Søren Sandmann Pedersen authored
Use a temporary variable s containing the absolute value of the stride as the upper bound in the inner loops. V2: Do this for the bpp == 16 case as well
-
- Sep 26, 2013
-
-
Søren Sandmann Pedersen authored
Commit 4312f077 claimed to have made print_image() work with negative strides, but it didn't actually work. When the stride was negative, the image buffer would be accessed as if the stride were positive. Fix the bug by not changing the stride variable and instead using a temporary, s, that contains the absolute value of stride.
-
Søren Sandmann Pedersen authored
The generated fetchers for NEAREST, BILINEAR, and SEPARABLE_CONVOLUTION filters are fast paths and so they belong in pixman-fast-path.c
-
Søren Sandmann Pedersen authored
This iterator is really a fast path, so it belongs in the fast path implementation.
-
Søren Sandmann Pedersen authored
Instead of having logic to swap the lines around when one of them doesn't match, store the two lines in an array and use the least significant bit of the y coordinate as the index into that array. Since the two lines always have different least significant bits, they will never collide. The effect is that lines corresponding to even y coordinates are stored in info->lines[0] and lines corresponding to odd y coordinates are stored in info->lines[1].
-
- Sep 20, 2013
-
-
Søren Sandmann Pedersen authored
Pixman supports negative strides, but up until now they haven't been tested outside of stress-test. This commit adds testing of negative strides to blitters-test, scaling-test, affine-test, rotate-test, and composite-traps-test.
-
Søren Sandmann Pedersen authored
The affine-test, blitters-test, and scaling-test all have the ability to print out the bytes of the destination image. Share this code by moving it to utils.c. At the same time make the code work correctly with negative strides.
-
Søren Sandmann Pedersen authored
By using this function instead of compute_crc32() the alpha masking code and the call to image_endian_swap() are not duplicated.
-
- Sep 16, 2013
-
-
Søren Sandmann Pedersen authored
Converting a double precision number to 16.16 fixed point should be done by multiplying with 65536.0, not 65535.0. The bug could potentially cause certain filters that would otherwise leave the image bit-for-bit unchanged under an identity transformation, to not do so, but the numbers are close enough that there weren't any visual differences.
-
Søren Sandmann Pedersen authored
The separable convolution filter supports a subsample_bits of 0 which corresponds to no subsampling at all, so allow this value to be used in the scale demo.
-
Søren Sandmann Pedersen authored
This new iterator uses the SSSE3 instructions pmaddubsw and pabsw to implement a fast iterator for bilinear scaling. There is a graph here recording the per-pixel time for various bilinear scaling algorithms as reported by scaling-bench: http://people.freedesktop.org/~sandmann/ssse3.v2/ssse3.v2.png As the graph shows, this new iterator is clearly faster than the existing C iterator, and when used with an SSE2 combiner, it is also faster than the existing SSE2 fast paths for upscaling, though not for downscaling. Another graph: http://people.freedesktop.org/~sandmann/ssse3.v2/movdqu.png shows the difference between writing to iter->buffer with movdqa, movdqu on an aligned buffer, and movdqu on a deliberately unaligned buffer. Since the differences are very small, the patch here avoids using movdqa because imposing alignment restrictions on iter->buffer may interfere with other optimizations, such as writing directly to the destination image. The data was measured with scaling-bench on a Sandy Bridge Core i3-2350M @ 2.3GHz and is available in this directory: http://people.freedesktop.org/~sandmann/ssse3.v2/ where there is also a Gnumeric spreadsheet ssse3.v2.gnumeric containing the per-pixel values and the graph. V2: - Use uintptr_t instead of unsigned long in the ALIGN macro - Use _mm_storel_epi64 instead of _mm_cvtsi128_si64 as the latter form is not available on x86-32. - Use _mm_storeu_si128() instead of _mm_store_si128() to avoid imposing alignment requirements on iter->buffer
-
Søren Sandmann Pedersen authored
This commit adds a new, empty SSSE3 implementation and the associated build system support. configure.ac: detect whether the compiler understands SSSE3 intrinsics and set up the required CFLAGS Makefile.am: Add libpixman-ssse3.la pixman-x86.c: Add X86_SSSE3 feature flag and detect it in detect_cpu_features(). pixman-ssse3.c: New file with an empty SSSE3 implementation V2: Remove SSSE3_LDFLAGS since it isn't necessary unless Solaris support is added.
-
Søren Sandmann Pedersen authored
At the moment iter buffers are only guaranteed to be aligned to a 4 byte boundary. SIMD implementations benefit from the buffers being aligned to 16 bytes, so ensure this is the case. V2: - Use uintptr_t instead of unsigned long - allocate 3 * SCANLINE_BUFFER_LENGTH byte on stack rather than just SCANLINE_BUFFER_LENGTH - use sizeof (stack_scanline_buffer) instead of SCANLINE_BUFFER_LENGTH to determine overflow
-
Siarhei Siamashka authored
The loops are already unrolled, so it was just a matter of packing 4 pixels into a single XMM register and doing aligned 128-bit writes to memory via MOVDQA instructions for the SRC compositing operator fast path. For the other fast paths, this XMM register is also directly routed to further processing instead of doing extra reshuffling. This replaces "8 PACKSSDW/PACKUSWB + 4 MOVD" instructions with "3 PACKSSDW/PACKUSWB + 1 MOVDQA" per 4 pixels, which results in a clear performance improvement. There are also some other (less important) tweaks: 1. Convert 'pixman_fixed_t' to 'intptr_t' before using it as an index for addressing memory. The problem is that 'pixman_fixed_t' is a 32-bit data type and it has to be extended to 64-bit offsets, which needs extra instructions on 64-bit systems. 2. Allow to recalculate the horizontal interpolation weights only once per 4 pixels by treating the XMM register as four pairs of 16-bit values. Each of these 16-bit/16-bit pairs can be replicated to fill the whole 128-bit register by using PSHUFD instructions. So we get "3 PADDW/PSRLW + 4 PSHUFD" instructions per 4 pixels instead of "12 PADDW/PSRLW" per 4 pixels (or "3 PADDW/PSRLW" per each pixel). Now a good question is whether replacing "9 PADDW/PSRLW" with "4 PSHUFD" is a favourable exchange. As it turns out, PSHUFD instructions are very fast on new Intel processors (including Atoms), but are rather slow on the first generation of Core2 (Merom) and on the other processors from that time or older. A good instructions latency/throughput table, covering all the relevant processors, can be found at: http://www.agner.org/optimize/instruction_tables.pdf Enabling this optimization is controlled by the PSHUFD_IS_FAST define in "pixman-sse2.c". 3. One use of PSHUFD instruction (_mm_shuffle_epi32 intrinsic) in the older code has been also replaced by PUNPCKLQDQ equivalent (_mm_unpacklo_epi64 intrinsic) in PSHUFD_IS_FAST=0 configuration. The PUNPCKLQDQ instruction is usually faster on older processors, but has some side effects (instead of fully overwriting the destination register like PSHUFD does, it retains half of the original value, which may inhibit some compiler optimizations). Benchmarks with "lowlevel-blt-bench -b src_8888_8888" using GCC 4.8.1 on x86-64 system and default optimizations. The results are in MPix/s: ====== Intel Core2 T7300 (2GHz) ====== old: src_8888_8888 = L1: 128.69 L2: 125.07 M:124.86 over_8888_8888 = L1: 83.19 L2: 81.73 M: 80.63 over_8888_n_8888 = L1: 79.56 L2: 78.61 M: 77.85 over_8888_8_8888 = L1: 77.15 L2: 75.79 M: 74.63 new (PSHUFD_IS_FAST=0): src_8888_8888 = L1: 168.67 L2: 163.26 M:162.44 over_8888_8888 = L1: 102.91 L2: 100.43 M: 99.01 over_8888_n_8888 = L1: 97.40 L2: 95.64 M: 94.24 over_8888_8_8888 = L1: 98.04 L2: 95.83 M: 94.33 new (PSHUFD_IS_FAST=1): src_8888_8888 = L1: 154.67 L2: 149.16 M:148.48 over_8888_8888 = L1: 95.97 L2: 93.90 M: 91.85 over_8888_n_8888 = L1: 93.18 L2: 91.47 M: 90.15 over_8888_8_8888 = L1: 95.33 L2: 93.32 M: 91.42 ====== Intel Core i7 860 (2.8GHz) ====== old: src_8888_8888 = L1: 323.48 L2: 318.86 M:314.81 over_8888_8888 = L1: 187.38 L2: 186.74 M:182.46 new (PSHUFD_IS_FAST=0): src_8888_8888 = L1: 373.06 L2: 370.94 M:368.32 over_8888_8888 = L1: 217.28 L2: 215.57 M:211.32 new (PSHUFD_IS_FAST=1): src_8888_8888 = L1: 401.98 L2: 397.65 M:395.61 over_8888_8888 = L1: 218.89 L2: 217.56 M:213.48 The most interesting benchmark is "src_8888_8888" (because this code can be reused for a generic non-separable SSE2 bilinear fetch iterator). The results shows that PSHUFD instructions are bad for Intel Core2 T7300 (Merom core) and good for Intel Core i7 860 (Nehalem core). Both of these processors support SSSE3 instructions though, so they are not the primary targets for SSE2 code. But without having any other more relevant hardware to test, PSHUFD_IS_FAST=0 seems to be a reasonable default for SSE2 code and old processors (until the runtime CPU features detection becomes clever enough to recognize different microarchitectures). (Rebased on top of patch that removes support for 8-bit bilinear filtering -ssp)
-