- Jan 30, 2013
-
-
Søren Sandmann Pedersen authored
-
Søren Sandmann Pedersen authored
In c2cb303d, return_if_fail()s were added to prevent the trapezoid rasterizers from being called with non-alpha formats. However, stress-test actually does call the rasterizers with non-alpha formats, but because _pixman_log_error() is disabled in versions with an odd minor number, the errors never materialized. Fix this by changing the argument to random format to an enum of three values DONT_CARE, PREFER_ALPHA, or REQUIRE_ALPHA, and then in the switch that calls the trapezoid rasterizers, pass the appropriate value for the function in question.
-
- Jan 29, 2013
-
-
Søren Sandmann Pedersen authored
The old one belongs to the email address sandmann@daimi.au.dk, which doesn't work anyore. Also use gpg to get the name and address for the "(Signed by ...)" line since that works more reliably for me than using git.
-
Ben Avison authored
In particular this affects single-core ARMs (e.g. ARM11, Cortex-A8), which are usually configured this way. For other CPUs, this should only add a constant time, which will be cancelled out by the EXCLUDE_OVERHEAD runs. The problems were caused by cachelines becoming permanently evicted from the cache, because the code that was intended to pull them back in again on each iteration assumed too long a cache line (for the L1 test) or failed to read memory beyond the first pixel row (for the L2 test). Also, the reloading of the source buffer was unnecessary. These issues were identified by Siarhei in this post: http://lists.freedesktop.org/archives/pixman/2013-January/002543.html
-
Søren Sandmann Pedersen authored
The clip_color() function has some checks to avoid division by zero, but they are done by comparing the value to 4 * FLT_EPSILON, where a better choice is the IS_ZERO() macro that compares to +/- FLT_MIN. In set_sat(), the check is that *max > *min before dividing by *max - *min, but that has the potential problem that interactions between GCC optimizions and 80 bit x87 registers could mean that (*max > *min) is true in 80 bits, but (*max - *min) is 0 in 32 bits, so that the division by zero is not prevented. Using IS_ZERO() here as well prevents this.
-
Improved by adding preloads, combining writes and using the SEL instruction. add_8_8 Before After Mean StdDev Mean StdDev Confidence Change L1 62.1 0.2 543.4 12.4 100.0% +774.9% L2 38.7 0.4 116.8 1.7 100.0% +201.8% M 40.0 0.1 110.1 0.5 100.0% +175.3% HT 30.9 0.2 43.4 0.5 100.0% +40.4% VT 30.6 0.3 39.2 0.5 100.0% +28.0% R 21.3 0.2 35.4 0.4 100.0% +66.6% RT 8.6 0.2 10.2 0.3 100.0% +19.4% over_8888_8888 Before After Mean StdDev Mean StdDev Confidence Change L1 32.3 0.1 38.0 0.2 100.0% +17.7% L2 15.9 0.4 30.6 0.5 100.0% +92.8% M 13.3 0.0 25.6 0.0 100.0% +92.9% HT 10.5 0.1 15.5 0.1 100.0% +47.1% VT 10.4 0.1 14.6 0.1 100.0% +40.8% R 10.3 0.1 15.8 0.1 100.0% +53.3% RT 6.0 0.1 7.6 0.1 100.0% +25.9% over_8888_n_8888 Before After Mean StdDev Mean StdDev Confidence Change L1 17.6 0.1 21.0 0.1 100.0% +19.2% L2 11.2 0.2 19.2 0.1 100.0% +71.2% M 10.2 0.0 19.6 0.0 100.0% +92.6% HT 8.4 0.0 11.9 0.1 100.0% +41.7% VT 8.3 0.0 11.3 0.1 100.0% +36.4% R 8.3 0.0 11.8 0.1 100.0% +43.1% RT 5.1 0.1 6.2 0.1 100.0% +21.3% over_n_8_8888 Before After Mean StdDev Mean StdDev Confidence Change L1 17.5 0.1 22.8 0.8 100.0% +30.1% L2 14.2 0.3 21.7 0.2 100.0% +52.6% M 12.0 0.0 22.3 0.0 100.0% +84.8% HT 10.5 0.1 14.1 0.1 100.0% +34.5% VT 10.0 0.1 13.5 0.1 100.0% +35.3% R 9.4 0.0 12.9 0.2 100.0% +37.7% RT 5.5 0.1 6.5 0.2 100.0% +19.2%
-
There was no previous attempt at accelerating these specifically for ARMv6. src_x888_8888 Before After Mean StdDev Mean StdDev Confidence Change L1 96.7 0.5 270.4 2.6 100.0% +179.5% L2 44.6 2.7 110.6 9.7 100.0% +148.0% M 26.9 0.1 87.6 0.5 100.0% +226.1% HT 19.3 0.2 37.5 0.4 100.0% +93.7% VT 18.6 0.1 33.7 0.4 100.0% +81.6% R 18.4 0.1 32.2 0.3 100.0% +75.2% RT 9.2 0.2 12.1 0.3 100.0% +31.4% src_0565_8888 Before After Mean StdDev Mean StdDev Confidence Change L1 37.0 0.3 66.9 0.2 100.0% +80.8% L2 30.3 0.2 55.9 0.3 100.0% +84.4% M 25.9 0.0 62.3 0.2 100.0% +140.3% HT 15.2 0.1 33.1 0.3 100.0% +116.9% VT 15.1 0.1 30.7 0.3 100.0% +103.6% R 14.2 0.1 27.6 0.3 100.0% +94.0% RT 6.0 0.1 11.2 0.3 100.0% +87.2%
-
These are usable either as various composite operations, or via the top-level function pixman_blt() which now does some blitting for the first time on an ARMv6 platform (previously it just returned FALSE). src_8888_8888 Before After Mean StdDev Mean StdDev Confidence Change L1 414.5 9.4 445.8 3.6 100.0% +7.6% L2 93.3 20.7 114.5 12.9 100.0% +22.7% M 57.0 0.2 89.2 0.5 100.0% +56.4% HT 28.7 0.3 39.6 0.4 100.0% +37.9% VT 25.5 0.2 35.3 0.4 100.0% +38.4% R 20.1 0.1 33.8 0.3 100.0% +67.8% RT 7.8 0.2 12.7 0.4 100.0% +62.7% src_0565_0565 Before After Mean StdDev Mean StdDev Confidence Change L1 397.4 6.1 412.5 5.2 100.0% +3.8% L2 143.2 10.9 141.9 6.5 68.9% -0.9% (insignificant) M 90.7 0.4 133.5 0.7 100.0% +47.1% HT 38.6 0.3 53.7 0.7 100.0% +39.0% VT 33.0 0.3 47.3 0.6 100.0% +43.3% R 25.7 0.2 42.1 0.5 100.0% +64.1% RT 8.0 0.2 13.3 0.3 100.0% +65.6% src_8_8 Before After Mean StdDev Mean StdDev Confidence Change L1 716.5 9.8 768.2 20.4 100.0% +7.2% L2 246.2 12.7 260.5 8.8 100.0% +5.8% M 146.8 0.7 227.9 0.7 100.0% +55.2% HT 44.9 0.6 62.1 1.0 100.0% +38.2% VT 35.6 0.4 53.4 0.7 100.0% +50.0% R 29.7 0.3 48.2 0.6 100.0% +62.2% RT 8.6 0.2 12.9 0.4 100.0% +49.3%
-
Note that this also effectively accelerates src_n_8888, src_n_0565 and src_n_8 composite types, because of the fast paths in pixman-fast-path.c implemented by fast_composite_solid_fill(), which end up dispatching these platform-specific fill routines. src_n_8888 Before After Mean StdDev Mean StdDev Confidence Change L1 157.3 1.1 574.2 8.7 100.0% +265.0% L2 94.2 0.5 364.8 4.2 100.0% +287.3% M 92.7 0.4 358.7 1.1 100.0% +287.1% HT 68.5 0.9 133.6 4.0 100.0% +95.2% VT 61.3 0.8 111.8 2.6 100.0% +82.4% R 61.1 0.9 108.7 2.8 100.0% +78.1% RT 24.6 1.0 28.6 1.6 100.0% +16.0% src_n_0565 Before After Mean StdDev Mean StdDev Confidence Change L1 157.4 1.0 983.1 38.5 100.0% +524.6% L2 93.6 0.5 696.0 14.3 100.0% +643.4% M 92.7 0.4 680.5 1.0 100.0% +634.0% HT 68.3 0.9 160.3 6.6 100.0% +134.6% VT 61.1 0.8 130.1 3.4 100.0% +112.9% R 61.0 0.8 125.4 4.1 100.0% +105.7% RT 24.9 1.3 29.5 1.5 100.0% +18.2% src_n_8 Before After Mean StdDev Mean StdDev Confidence Change L1 154.7 1.0 1324.4 48.5 100.0% +756.3% L2 92.4 0.4 1178.4 10.9 100.0% +1175.6% M 92.9 0.4 1275.7 2.1 100.0% +1273.5% HT 68.2 1.0 169.8 5.5 100.0% +149.0% VT 61.2 1.0 138.5 3.6 100.0% +126.3% R 61.3 0.9 130.1 3.8 100.0% +112.4% RT 25.5 1.3 29.2 1.9 100.0% +14.6%
-
Move the entire contents of pixman-arm-simd-asm.S to a new file; ultimately this will only retain the scaled operations, so it is named pixman-arm-simd-asm-scaled.S. Added new header file pixman-arm-simd-asm.h, containing the macros which are the basis of all the new ARMv6 implementations, although at this point in the series, nothing uses them and the library should be binary-identical.
-
- Jan 28, 2013
-
-
Søren Sandmann Pedersen authored
For large upscalings the level of subsampling for the filter has a quite visible effect, so make it settable in the UI so that people can experiment with various values.
-
- Jan 27, 2013
-
-
Siarhei Siamashka authored
Old functions pixman_transform_point() and pixman_transform_point_3d() now become just wrappers for pixman_transform_point_31_16() and pixman_transform_point_31_16_3d(). Eventually their uses should be completely eliminated in the pixman code and replaced with their extended range counterparts. This is needed in order to be able to correctly handle any matrices and parameters that may come to pixman from the code responsible for XRender implementation.
-
Siarhei Siamashka authored
This test uses __float128 data type when it is available for implementing a "perfect" reference implementation. The output from from pixman_transform_point_31_16() and pixman_transform_point_31_16_affine() is compared with the reference implementation to make sure that the rounding errors may only show up in a single least significant bit. The platforms and compilers, which do not support __float128 data type, can rely on crc32 checksum for the pseudorandom transform results.
-
Siarhei Siamashka authored
GCC supports 128-bit floating point data type on some platforms (including but not limited to x86 and x86-64). This may be useful for tests, which need prefectly accurate reference implementations of certain algorithms.
-
Siarhei Siamashka authored
The following new functions are added: pixman_transform_point_31_16_3d() - Calculates the product of a matrix and a vector multiplication. pixman_transform_point_31_16() - Calculates the product of a matrix and a vector multiplication. Then converts the homogenous resulting vector [x, y, z] to cartesian [x', y', 1] variant, where x' = x / z, and y' = y / z. pixman_transform_point_31_16_affine() - A faster sibling of the other two functions, which assumes affine transformation, where the bottom row of the matrix is [0, 0, 1] and the last element of the input vector is set to 1. These functions transform a point with 31.16 fixed point coordinates from the destination space to a point with 48.16 fixed point coordinates in the source space. The results are accurate and the rounding errors may only show up in the least significant bit. No overflows are possible for the affine transformations as long as the input data is provided in 31.16 format. In the case of projective transformations, some output values may be not representable using 48.16 fixed point format. In this case the results are clamped to return maximum or minimum 48.16 values (so that the caller can at least handle NONE and PAD repeats correctly).
-
Siarhei Siamashka authored
Processing two pixels at once is used to reduce the number of arithmetic operations. The speedup relative to the generic fetch_scanline_r5g6b5() from "pixman-access.c" (pixman was compiled with gcc 4.7.2): MIPS 74K 480MHz : 20.32 MPix/s -> 26.47 MPix/s ARM11 700MHz : 34.95 MPix/s -> 38.22 MPix/s ARM Cortex-A8 1000MHz : 87.44 MPix/s -> 100.92 MPix/s ARM Cortex-A9 1700MHz : 150.95 MPix/s -> 158.13 MPix/s ARM Cortex-A15 1700MHz : 148.91 MPix/s -> 155.42 MPix/s IBM Cell PPU 3200MHz : 75.29 MPix/s -> 98.33 MPix/s Intel Core i7 2800MHz : 257.02 MPix/s -> 376.93 MPix/s That's the performance for C code (SIMD and assembly optimizations are disabled via PIXMAN_DISABLE environment variable).
-
Siarhei Siamashka authored
Unrolling loops improves performance, so just use it here. Also GCC can't properly optimize this code for RISC processors and allocate 0x1F001F constant in a register. Because this constant is too large to be represented as an immediate operand in instructions, GCC inserts some redundant arithmetics. This problem can be workarounded by explicitly using a variable for 0x1F001F constant and also initializing it by a read from another volatile variable. In this case GCC is forced to allocate a register for it, because it is not seen as a constant anymore. The speedup relative to the generic store_scanline_r5g6b5() from "pixman-access.c" (pixman was compiled with gcc 4.7.2): MIPS 74K 480MHz : 33.22 MPix/s -> 43.42 MPix/s ARM11 700MHz : 50.16 MPix/s -> 78.23 MPix/s ARM Cortex-A8 1000MHz : 117.75 MPix/s -> 196.34 MPix/s ARM Cortex-A9 1700MHz : 177.04 MPix/s -> 320.32 MPix/s ARM Cortex-A15 1700MHz : 231.44 MPix/s -> 261.64 MPix/s IBM Cell PPU 3200MHz : 130.25 MPix/s -> 145.61 MPix/s Intel Core i7 2800MHz : 502.21 MPix/s -> 721.73 MPix/s That's the performance for C code (SIMD and assembly optimizations are disabled via PIXMAN_DISABLE environment variable).
-
Siarhei Siamashka authored
Adding specialized iterators for r5g6b5 color format allows us to work on fine tuning performance of r5g6b5 fetch/write-back operations in the pixman general "fetch -> combine -> store" pipeline. These iterators also make "src_x888_0565" fast path redundant, so it can be removed.
-
Chris Wilson authored
Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
-
Chris Wilson authored
We should always have at least a C combiner available, so we never expect the search to fail. If it does, emit an error and return a dummy function. Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
-
Chris Wilson authored
We never expect to fail to find the appropriate function as the general_composite_rect should always match. So if somehow we fallthrough the search, emit a _pixman_log_error() and return a dummy function. Note that we remove some conditionals and a level of indentation hence a large amount of code movement. This also reveals that in a few places we are duplicating stack variables that can be eliminated later. Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
-
Chris Wilson authored
Based on the existing sse2_8888_n_8888 nearest scaling routines. fishbowl on an i5-2500: 60.9s -> 56.9s Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
-
Chris Wilson authored
This path is being exercised by compositing of trapezoids for clipmasks, for instance as used in the firefox-asteroids cairo-trace. IVB i7-3720qm ./tests/lowlevel-blt-bench add_n_8_8888: reference memcpy speed = 14846.7MB/s (3711.7MP/s for 32bpp fills) before: L1: 681.10 L2: 735.14 M:701.44 ( 28.35%) HT:283.32 VT:213.23 R:208.93 RT: 77.89 ( 793Kops/s) after: L1: 992.91 L2:1017.33 M:982.58 ( 39.88%) HT:458.93 VT:332.32 R:326.13 RT:136.66 (1287Kops/s) Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
-
Chris Wilson authored
This path is being exercised by inplace compositing of trapezoids, for instance as used in the firefox-asteroids cairo-trace. IVB i3-3720qm ./tests/lowlevel-blt-bench add_n_888: reference memcpy speed = 14918.3MB/s (3729.6MP/s for 32bpp fills) before: L1:1752.44 L2:2259.48 M:2215.73 ( 58.80%) HT:589.49 VT:404.04 R:424.69 RT:134.68 (1182Kops/s) after: L1:3931.21 L2:6132.78 M:3440.17 ( 92.24%) HT:1337.70 VT:1357.64 R:1270.27 RT:359.78 (2161Kops/s) Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
-
- Jan 25, 2013
-
-
Jeff Muizelaar authored
Having 4 or fewer bits means we can do two components at a time in a single 32 bit register. Here are the results for firefox-fishtank on a Pandaboard with 4.6.3 and PIXMAN_DISABLE="arm-neon" Before: [ # ] backend test min(s) median(s) stddev. count [ 0] image t-firefox-fishtank 7.841 7.910 0.70% 6/6 After: [ # ] backend test min(s) median(s) stddev. count [ 0] image t-firefox-fishtank 6.951 6.995 1.11% 6/6
-
Ben Avison authored
This adds two extra tests, src_n_8 and src_8_8, which I have been using to benchmark my ARMv6 changes. I'd also like to propose that it requires an exact test name as the executable's argument, as achieved by this strstr to strcmp change. Without this, it is impossible to only benchmark (for example) add_8_8, add_n_8 or src_n_8, due to those also being substrings of many other test names.
-
- Jan 23, 2013
-
-
Søren Sandmann Pedersen authored
With the operator_name() and format_name() functions there is no longer any reason for composite.c to have its own table of format and operator names.
-
Søren Sandmann Pedersen authored
This function returns the name of the given format code, which is useful for printing out debug information. The function is written as a switch without a default value so that the compiler will warn if new formats are added in the future. The fake formats used in the fast path tables are also recognized. The function is used in alpha_map.c, where it replaces an existing format_name() function, and in blitters-test.c, affine-test.c, and scaling-test.c.
-
Søren Sandmann Pedersen authored
This function returns the name of the given operator, which is useful for printing out debug information. The function is done as a switch without a default value so that the compiler will warn if new operators are added in the future. The function is used in affine-test.c, scaling-test.c, and blitters-test.c.
-
Søren Sandmann Pedersen authored
Ben Avison pointed out here: http://lists.freedesktop.org/archives/pixman/2013-January/002485.html that there isn't really any documentation about how to submit patches to pixman. This patch adds some information to the README file. v2: Incorporate some comments from Ben Avison v3: Change gitweb URL to cgit
-
Matt Turner authored
INCLUDES has been deprecated starting with automake 1.13. Convert all occurrences with the recommended AM_CPPFLAGS replacement.
-
Matt Turner authored
-
- Jan 22, 2013
-
-
Nemanja Lukic authored
- over_reverse_n_8888 - in_n_8_8 Performance numbers before/after on MIPS-74kc @ 1GHz: lowlevel-blt-bench results Referent (before): over_reverse_n_8888 = L1: 19.42 L2: 19.07 M: 15.38 ( 40.80%) HT: 13.35 VT: 13.10 R: 12.92 RT: 8.27 ( 49Kops/s) in_n_8_8 = L1: 21.20 L2: 22.86 M: 21.42 ( 14.21%) HT: 15.97 VT: 15.69 R: 15.47 RT: 8.00 ( 48Kops/s) Optimized: over_reverse_n_8888 = L1: 60.09 L2: 47.87 M: 28.65 ( 76.02%) HT: 23.58 VT: 22.51 R: 21.99 RT: 12.28 ( 60Kops/s) in_n_8_8 = L1: 89.38 L2: 86.07 M: 65.48 ( 43.44%) HT: 44.64 VT: 41.50 R: 40.77 RT: 16.94 ( 66Kops/s)
-
Nemanja Lukic authored
- out_reverse_8_0565 - out_reverse_8_8888 Performance numbers before/after on MIPS-74kc @ 1GHz: lowlevel-blt-bench results Referent (before): out_reverse_8_0565 = L1: 14.29 L2: 13.58 M: 12.14 ( 24.16%) HT: 9.23 VT: 9.12 R: 8.84 RT: 4.75 ( 36Kops/s) out_reverse_8_8888 = L1: 27.46 L2: 23.24 M: 17.41 ( 57.73%) HT: 12.61 VT: 12.47 R: 11.79 RT: 5.86 ( 41Kops/s) Optimized: out_reverse_8_0565 = L1: 28.24 L2: 25.64 M: 20.63 ( 41.05%) HT: 16.69 VT: 16.14 R: 15.50 RT: 8.69 ( 52Kops/s) out_reverse_8_8888 = L1: 52.78 L2: 41.44 M: 23.50 ( 77.94%) HT: 18.79 VT: 18.16 R: 16.90 RT: 9.11 ( 53Kops/s)
-
- Jan 06, 2013
-
-
Søren Sandmann Pedersen authored
v2: Don't return a pointer to uninitialized memory when the allocation of horz and vert fails, but allocation of params doesn't.
-
Søren Sandmann Pedersen authored
The noop src iterator already has code to handle solid images, but that code never actually runs currently because it is not possible for an image to have both a format code of PIXMAN_solid and a flag of FAST_PATH_BITS_IMAGE. If these two were to be set at the same time, the fast_composite_tiled_repeat() fast path would trigger for solid images (because it triggers for PIXMAN_any formats, which includes PIXMAN_solid), but for solid images we can usually do better than that fast path. So this patch removes _pixman_solid_fill_iter_init() and instead handles such images (along with repeating 1x1 bits images without an alpha map) in pixman-noop.c. When a 1x1R image is involved in the general composite path, before this patch, it would hit this code in repeat() in pixman-inlines.h: while (*c >= size) *c -= size; while (*c < 0) *c += size; and those loops could run for a huge number of iteratons (proportional to the composite width). For such cases, the performance improvement is really big: ./test/lowlevel-blt-bench -n add_n_8888: Before: add_n_8888 = L1: 3.86 L2: 3.78 M: 1.40 ( 0.06%) HT: 1.43 VT: 1.41 R: 1.41 RT: 1.38 ( 19Kops/s) After: add_n_8888 = L1:1236.86 L2:2468.49 M:1097.88 ( 49.04%) HT:476.49 VT:429.05 R:417.04 RT:155.12 ( 817Kops/s)
-
- Jan 03, 2013
-
-
Automake-1.13 has removed long obsolete AM_CONFIG_HEADER macro ( http://lists.gnu.org/archive/html/automake/2012-12/msg00038.html ) and autoreconf errors out upon seeing it. Attached patch replaces obsolete AM_CONFIG_HEADER with now proper AC_CONFIG_HEADERS.
-
Siarhei Siamashka authored
-
Siarhei Siamashka authored
C++ compilers do not define SIZE_MAX. It is also not available if the code is compiled by some C compilers: http://lists.freedesktop.org/archives/pixman/2012-August/002196.html
-
- Dec 20, 2012
-
-
Siarhei Siamashka authored
-