Commits · 4d8d2fa47e457e3c8a5ab956b52cff4785aa45c3 · Richard Henderson / pixman

Dec 17, 2010
- COPYING: added Nokia to the list of copyright holders · 4d8d2fa4
  Siarhei Siamashka authored 14 years ago
  
  4d8d2fa4
Dec 07, 2010

Fix for potential unaligned memory accesses · 3d094997

The temporary scanline buffer allocated on stack was declared
as uint8_t array. As a result, the compiler was free to select
any arbitrary alignment for it (even though there is typically
no reason to use really weird alignments here and the stack is
normally at least 4 bytes aligned on most platforms). Having
improper alignment is non-portable and can impact performance
or even make the code misbehave depending on the target platform.

Using uint64_t type for this array should ensure that any possible
memory accesses done by pixman code are going to be handled correctly
(pixman-combine64.c can access this buffer via uint64_t * pointer).

Some alignment related problem was reported in:
http://lists.freedesktop.org/archives/pixman/2010-November/000747.html

3d094997

ARM: added 'neon_src_rpixbuf_8888' fast path · 985e59a8

Siarhei Siamashka authored 14 years ago

With this optimization added, pixman assisted conversion from
non-premultiplied to premultiplied alpha format is now fully
NEON optimized (both with and without R/B color components
swapping in the process).

985e59a8

Dec 03, 2010

ARM: added 'neon_composite_in_n_8' fast path · 733f6891
Siarhei Siamashka authored 14 years ago

733f6891

ARM: added flags parameter to some asm fast path wrapper macros · af7a69d9

Siarhei Siamashka authored 14 years ago

Not all types of operations can be skipped when having transparent
solid source or transparent solid mask. Add an extra flags parameter
for providing this information to the wrappers.

af7a69d9

ARM: added 'neon_composite_add_8888_n_8888' fast path · f6843e37
Siarhei Siamashka authored 14 years ago

f6843e37
ARM: added 'neon_composite_add_n_8_8888' fast path · b066b520
Siarhei Siamashka authored 14 years ago

b066b520

ARM: better NEON instructions scheduling for add_8888_8888_8888 · 1fba7790

Siarhei Siamashka authored 14 years ago

Provides a minor performance improvement by using pipelining and hiding
instructions latencies. Also do not clobber d0-d3 registers (source
image pixels) while doing calculations in order to allow the use of
the same macro for add_n_8_8888 fast path later.

Benchmark from ARM Cortex-A8 @500MHz:

== before ==

  add_8888_8888_8888 = L1:  95.94  L2:  42.27  M: 25.60 (121.09%)
                       HT:  14.54  VT:  13.13  R: 12.77  RT:  4.49 (48Kops/s)
     add_8888_8_8888 = L1: 104.51  L2:  57.81  M: 36.06 (106.62%)
                       HT:  19.24  VT:  16.45  R: 14.71  RT:  4.80 (51Kops/s)

== after ==

  add_8888_8888_8888 = L1: 106.66  L2:  47.82  M: 27.32 (129.30%)
                       HT:  15.44  VT:  13.96  R: 12.86  RT:  4.48 (48Kops/s)
     add_8888_8_8888 = L1: 107.72  L2:  61.02  M: 38.26 (113.16%)
                       HT:  19.48  VT:  16.72  R: 14.82  RT:  4.80 (51Kops/s)

1fba7790

ARM: added 'neon_composite_add_8888_8_8888' fast path · c3f48b6a
Siarhei Siamashka authored 14 years ago

c3f48b6a
ARM: added 'neon_composite_over_0565_n_0565' fast path · 6d2f7f98
Siarhei Siamashka authored 14 years ago

6d2f7f98

ARM: reuse common NEON code for over_{n_8|8888_n|8888_8}_0565 · 3990931b

Siarhei Siamashka authored 14 years ago

Renamed suppementary macros from 'over_n_8_0565' to 'over_8888_8_0565',
because they can actually support all variants of this operation:
over_8888_8_0565/over_n_8_0565/over_8888_n_0565.

Also 'over_8888_8_0565' now uses more optimized common code instead of its
own variant, improving performance a bit. Even though this operation is
still memory bandwidth limited, scaled variants of these fast paths may
put more stress on CPU later.

Benchmarked on ARM Cortex-A8 @500MHz:

== before ==

    over_8888_8_0565 =  L1:  67.10  L2:  53.82  M: 44.70 (105.17%)
                        HT:  18.73  VT:  16.91  R: 14.25  RT:  4.80 (52Kops/s)

== after ==

    over_8888_8_0565 =  L1:  77.83  L2:  58.14  M: 44.82 (105.52%)
                        HT:  20.58  VT:  17.44  R: 15.05  RT:  4.88 (52Kops/s)

3990931b

ARM: added 'neon_composite_over_8888_n_0565' fast path · a7c36681
Siarhei Siamashka authored 14 years ago

a7c36681

ARM: better NEON instructions scheduling for over_n_8_0565 · e6814837

Siarhei Siamashka authored 14 years ago

Code rearranged to get better instructions scheduling for ARM Cortex-A8/A9.
Now it is ~30% faster for the pixel data in L1 cache and makes better use
of memory bandwidth when running at lower clock frequencies (ex. 500MHz).
Also register d24 (pixels from the mask image) is now not clobbered by
supplementary macros, which allows to reuse them for the other variants
of compositing operations later.

Benchmark from ARM Cortex-A8 @500MHz:

== before ==

    over_n_8_0565 =  L1:  63.90  L2:  63.15  M: 60.97 ( 73.53%)
                     HT:  28.89  VT:  24.14  R: 21.33  RT:  6.78 (  67Kops/s)

== after ==

    over_n_8_0565 =  L1:  82.64  L2:  75.19  M: 71.52 ( 84.14%)
                     HT:  30.49  VT:  25.56  R: 22.36  RT:  6.89 (  68Kops/s)

e6814837

ARM: introduced 'fetch_mask_pixblock' macro to simplify code · 3be86a92

Siarhei Siamashka authored 14 years ago

This macro hides the implementation details of pixels fetching
for the mask image just like 'fetch_src_pixblock' does for the
source image. This provides more possibilities for reusing the
same code blocks in different compositing functions.

This patch does not introduce any functional changes and the
resulting code in the compiled object file is exactly the same.

3be86a92

ARM: added 'neon_composite_over_n_8_8' fast path · 98d08b37
Siarhei Siamashka authored 14 years ago

98d08b37

Nov 22, 2010

C fast path for a1 fill operation · 4b5b5a2a

Siarhei Siamashka authored 14 years ago

Can be used as one of the solutions to fix bug
https://bugs.freedesktop.org/show_bug.cgi?id=31604

4b5b5a2a

Nov 21, 2010
- Sun's copyrights belong to Oracle now · 654961ef
  Alan Coopersmith authored 14 years ago
  
  Signed-off-by: Alan Coopersmith <alan.coopersmith@oracle.com>
  654961ef
Nov 19, 2010

Fix argument quoting for AC_INIT. · e7ee43c3

Cyril Brulebois authored 14 years ago


One gets rid of this accordingly:
| autoreconf -vfi
| autoreconf: Entering directory `.'
| autoreconf: configure.ac: not using Gettext
| autoreconf: running: aclocal --force
| configure.ac:61: warning: AC_INIT: not a literal: "pixman@lists.freedesktop.org"
| autoreconf: configure.ac: tracing
| configure.ac:61: warning: AC_INIT: not a literal: "pixman@lists.freedesktop.org"

Signed-off-by: Cyril Brulebois <kibi@debian.org>

e7ee43c3

Nov 16, 2010
- Post-release version bump to 0.21.3 · c59db8af
  Søren Sandmann Pedersen authored 14 years ago
  
  c59db8af
- Pre-release version bump · 4646c238
  Søren Sandmann Pedersen authored 14 years ago
  
  View commits for tag pixman-0.21.2 pixman-0.21.2
  
  4646c238
- Generate {a,x}8r8g8b8, a8, 565 fetchers for nearest/affine images · 536cf4dd
  Søren Sandmann Pedersen authored 14 years ago
  
  There are versions for all combinations of x8r8g8b8/a8r8g8b8 and pad/repeat/none/normal repeat modes. The bulk of each function is an inline function that takes a format and a repeat mode as parameters.
  536cf4dd
Nov 12, 2010

Improve conical gradients opacity check · da0176e8

Andrea Canciani authored 14 years ago

Conical gradients are completely opaque if all of their stops are
opaque and the repeat mode is not 'none'.

da0176e8

Fix opacity check · 151f2554

Andrea Canciani authored 14 years ago

Radial gradients are "conical", thus they can have some non-opaque
parts even if all of their stops are completely opaque.

To guarantee that a radial gradient is actually opaque, it needs to
also have one of the two circles containing the other one. In this
case when extrapolating, the whole plane is completely covered (as
explained in the comment in pixman-radial-gradient.c).

151f2554

Remove unused stop_range field · 19ed415b
Andrea Canciani authored 14 years ago

19ed415b

Nov 10, 2010

ARM: optimization for scaled src_0565_0565 with nearest filter · d8fe87a6

Siarhei Siamashka authored 14 years ago

The performance improvement is only in the ballpark of 5% when
compared against C code built with a reasonably good compiler
(gcc 4.5.1). But gcc 4.4 produces approximately 30% slower code
here, so assembly optimization makes sense to avoid dependency
on the compiler quality and/or optimization options.

Benchmark from ARM11:
    == before ==
    op=1, src_fmt=10020565, dst_fmt=10020565, speed=34.86 MPix/s

    == after ==
    op=1, src_fmt=10020565, dst_fmt=10020565, speed=36.62 MPix/s

Benchmark from ARM Cortex-A8:
    == before ==
    op=1, src_fmt=10020565, dst_fmt=10020565, speed=89.55 MPix/s

    == after ==
    op=1, src_fmt=10020565, dst_fmt=10020565, speed=94.91 MPix/s

d8fe87a6

ARM: NEON optimization for scaled src_0565_8888 with nearest filter · b8007d04

Siarhei Siamashka authored 14 years ago

Benchmark from ARM Cortex-A8 @720MHz:
    == before ==
    op=1, src_fmt=10020565, dst_fmt=20028888, speed=8.99 MPix/s

    == after ==
    op=1, src_fmt=10020565, dst_fmt=20028888, speed=76.98 MPix/s

    == unscaled ==
    op=1, src_fmt=10020565, dst_fmt=20028888, speed=137.78 MPix/s

b8007d04

ARM: NEON optimization for scaled src_8888_0565 with nearest filter · 2e855a2b

Siarhei Siamashka authored 14 years ago

Benchmark from ARM Cortex-A8 @720MHz:
    == before ==
    op=1, src_fmt=20028888, dst_fmt=10020565, speed=42.51 MPix/s

    == after ==
    op=1, src_fmt=20028888, dst_fmt=10020565, speed=55.61 MPix/s

    == unscaled ==
    op=1, src_fmt=20028888, dst_fmt=10020565, speed=117.99 MPix/s

2e855a2b

ARM: NEON optimization for scaled over_8888_0565 with nearest filter · 4a09e472

Siarhei Siamashka authored 14 years ago

Benchmark from ARM Cortex-A8 @720MHz:
    == before ==
    op=3, src_fmt=20028888, dst_fmt=10020565, speed=10.29 MPix/s

    == after ==
    op=3, src_fmt=20028888, dst_fmt=10020565, speed=36.36 MPix/s

    == unscaled ==
    op=3, src_fmt=20028888, dst_fmt=10020565, speed=79.40 MPix/s

4a09e472

ARM: NEON optimization for scaled over_8888_8888 with nearest filter · 67a4991f

Siarhei Siamashka authored 14 years ago

Benchmark from ARM Cortex-A8 @720MHz:
    == before ==
    op=3, src_fmt=20028888, dst_fmt=20028888, speed=12.73 MPix/s

    == after ==
    op=3, src_fmt=20028888, dst_fmt=20028888, speed=28.75 MPix/s

    == unscaled ==
    op=3, src_fmt=20028888, dst_fmt=20028888, speed=53.03 MPix/s

67a4991f

ARM: performance tuning of NEON nearest scaled pixel fetcher · 0b56244a

Siarhei Siamashka authored 14 years ago

Interleaving the use of NEON registers helps to avoid some stalls
in NEON pipeline and provides a small performance improvement.

0b56244a

ARM: macro template in C code to simplify using scaled fast paths · 6e76af0d

Siarhei Siamashka authored 14 years ago

This template can be used to instantiate scaled fast path functions
by providing main loop code and calling NEON assembly optimized
scanline processing functions from it. Another macro can be used
to simplify adding entries to fast path tables.

6e76af0d

ARM: nearest scaling support for NEON scanline compositing functions · 88014a0e

Siarhei Siamashka authored 14 years ago

Now it is possible to generate scanline processing functions
for the case when the source image is scaled with NEAREST filter.

Only 16bpp and 32bpp pixel formats are supported for now. But the
others can be also added later when needed. All the existing NEON
fast path functions should be quite easy to reuse for implementing
fast paths which can work with scaled source images.

88014a0e

ARM: NEON: source image pixel fetcher can be overrided now · 324712e4

Siarhei Siamashka authored 14 years ago

Added a special macro 'pixld_src' which is now responsible for fetching
pixels from the source image. Right now it just passes all its arguments
directly to 'pixld' macro, but it can be used in the future to provide
a special pixel fetcher for implementing nearest scaling.

The 'pixld_src' has a lot of arguments which define its behavior. But
for each particular fast path implementation, we already know NEON
registers allocation and how many pixels are processed in a single block.
That's why a higher level macro 'fetch_src_pixblock' is also introduced
(it's easier to use because it has no arguments) and used everywhere
in 'pixman-arm-neon-asm.S' instead of VLD instructions.

This patch does not introduce any functional changes and the resulting code
in the compiled object file is exactly the same.

324712e4

ARM: fix 'vld1.8'->'vld1.32' typo in add_8888_8888 NEON fast path · cb3f1830

Siarhei Siamashka authored 14 years ago

This was mostly harmless and had no effect on little endian systems.
But wrong vector element size is at least inconsistent and also
can theoretically cause problems on big endian ARM systems.

cb3f1830

Nov 05, 2010

Do CPU features detection from 'constructor' function when compiled with gcc · fed4a2fd

Siarhei Siamashka authored 14 years ago

There is attribute 'constructor' supported since gcc 2.7 which allows
to have a constructor function for library initialization. This eliminates
an extra branch for each composite operation and also helps to avoid
complains from race condition detection tools like helgrind.

The other compilers may or may not support this attribute properly.
Ideally, the compilers should fail to compile the code with unknown
attribute, so the configure check should do the right job. But in
reality the problems are surely possible. Fortunately such problems
should be quite easy to find because NULL pointer dereference should
happen almost immediately if the constructor fails to run.

clang 2.7:
  supports __attribute__((constructor)) properly and pretends to be gcc

tcc 0.9.25:
  ignores __attribute__((constructor)), but does not pretend to be gcc

fed4a2fd

Delete the source_image_t struct. · 99699771
Søren Sandmann Pedersen authored 14 years ago
```
It serves no purpose anymore now that the source_class_t field is gone.
```
99699771

[mmx] Mark some of the output variables as early-clobber. · f405b407

Søren Sandmann Pedersen authored 14 years ago


GCC assumes that input variables in inline assembly are fully consumed
before any output variable is written. This means it may allocate the
variables in the same register unless the output variables are marked
as early-clobber.

From Jeremy Huddleston:

    I noticed a problem building pixman with clang and reported it to
    the clang developers.  They responded back with a comment about
    the inline asm in pixman-mmx.c and suggested a fix:

    """
    Incidentally, Jeremy, in the asm that reads
    __asm__ (
    "movq %7, %0\n"
    "movq %7, %1\n"
    "movq %7, %2\n"
    "movq %7, %3\n"
    "movq %7, %4\n"
    "movq %7, %5\n"
    "movq %7, %6\n"
    : "=y" (v1), "=y" (v2), "=y" (v3),
      "=y" (v4), "=y" (v5), "=y" (v6), "=y" (v7)
    : "y" (vfill));

    all the output operands except the last one should be marked as
    earlyclobber ("=&y"). This is working by accident with gcc.
    """

Cc: jeremyhu@apple.com
Reviewed-by: Matt Turner <mattst88@gmail.com>

f405b407

Remove workaround for a bug in the 1.6 X server. · 9c19a85b

Søren Sandmann Pedersen authored 14 years ago

There used to be a bug in the X server where it would rely on
out-of-bounds accesses when it was asked to composite with a
window as the source. It would create a pixman image pointing
to some bogus position in memory, but then set a clip region
to the position where the actual bits were.

Due to a bug in old versions of pixman, where it would not clip
against the image bounds when a clip region was set, this would
actually work. So when the pixman bug was fixed, a workaround was
added to allow certain out-of-bound accesses.

However, the 1.6 X server is so old now that we can remove this
workaround. This does mean that if you update pixman to 0.22 or later,
you will need to use a 1.7 X server or later.

9c19a85b

Nov 01, 2010
- Fixed broken configure check for __thread support · 56748ea9
  Siarhei Siamashka authored 14 years ago
  
  Somehow the patch from [1] was not applied correctly, fixing that. 1. http://lists.cairographics.org/archives/cairo/2010-September/020826.html
  56748ea9
- COPYING: Stop saying that a modification is currently under discussion. · ecc36129
  Søren Sandmann Pedersen authored 14 years ago
  
  Also put the copyright text into a C comment for easier cut and paste.
  ecc36129

Admin message

Admin message