- Apr 30, 2013
-
-
Søren Sandmann Pedersen authored
Essentially all of it is obsolete by now.
-
Nemanja Lukic authored
Performance numbers before/after on MIPS-74kc @ 1GHz: lowlevel-blt-bench results Referent (before): rpixbuf = L1: 14.63 L2: 13.55 M: 9.91 ( 79.53%) HT: 8.47 VT: 8.32 R: 8.17 RT: 4.90 ( 33Kops/s) Optimized: rpixbuf = L1: 45.69 L2: 37.30 M: 17.24 (138.31%) HT: 15.66 VT: 14.88 R: 13.97 RT: 8.38 ( 44Kops/s)
-
Nemanja Lukic authored
Performance numbers before/after on MIPS-74kc @ 1GHz: lowlevel-blt-bench results Referent (before): pixbuf = L1: 18.18 L2: 16.47 M: 13.36 (107.27%) HT: 10.16 VT: 10.07 R: 9.84 RT: 5.54 ( 35Kops/s) Optimized: pixbuf = L1: 43.54 L2: 36.02 M: 17.08 (137.09%) HT: 15.58 VT: 14.85 R: 13.87 RT: 8.38 ( 44Kops/s)
-
Nemanja Lukic authored
Add necessary support to lowlevel-blt benchmark for benchmarking pixbuf and rpixbuf fast paths. bench_composite function now checks for pixbuf string in testname, and if that is detected, use same bits for src and mask images.
-
Nemanja Lukic authored
-
Nemanja Lukic authored
Rounding logic was not implemented right. Instead of using rounding version of the 8-bit shift, logical shifts were used. Also, code used unnecessary multiplications, which could be avoided by packing 4 destination (a8) pixel into one 32bit register. There were also, unnecessary spills on stack. Code is rewritten to address mentioned issues. The bug was revealed by increasing number of the iterations in blitters-test. Performance numbers on MIPS-74kc @ 1GHz: lowlevel-blt-bench results Referent (before): in_n_8 = L1: 21.20 L2: 22.86 M: 21.42 ( 14.21%) HT: 15.97 VT: 15.69 R: 15.47 RT: 8.00 ( 48Kops/s) Optimized (first implementation, with bug): in_n_8 = L1: 89.38 L2: 86.07 M: 65.48 ( 43.44%) HT: 44.64 VT: 41.50 R: 40.77 RT: 16.94 ( 66Kops/s) Optimized (with bug fix, and code revisited): in_n_8 = L1: 102.33 L2: 95.65 M: 70.54 ( 46.84%) HT: 48.35 VT: 45.06 R: 43.20 RT: 17.60 ( 66Kops/s)
-
Nemanja Lukic authored
Performance numbers before/after on MIPS-74kc @ 1GHz: lowlevel-blt-bench results Referent (before): src_0565_8888 = L1: 20.70 L2: 19.22 M: 12.50 ( 49.79%) HT: 10.45 VT: 10.18 R: 9.99 RT: 5.31 ( 31Kops/s) Optimized: src_0565_8888 = L1: 62.98 L2: 53.44 M: 23.07 ( 91.87%) HT: 19.85 VT: 19.15 R: 17.70 RT: 9.68 ( 43Kops/s)
-
Nemanja Lukic authored
Performance numbers before/after on MIPS-74kc @ 1GHz: lowlevel-blt-bench results Referent (before): over_8888_0565 = L1: 13.22 L2: 12.02 M: 9.77 ( 38.92%) HT: 8.58 VT: 8.35 R: 8.38 RT: 5.78 ( 35Kops/s) Optimized: over_8888_0565 = L1: 26.20 L2: 22.97 M: 15.92 ( 63.40%) HT: 13.33 VT: 13.13 R: 12.72 RT: 7.65 ( 39Kops/s)
-
Nemanja Lukic authored
Performance numbers before/after on MIPS-74kc @ 1GHz: lowlevel-blt-bench results Referent (before): over_8888_8888 = L1: 19.47 L2: 16.30 M: 11.24 ( 59.69%) HT: 9.54 VT: 9.29 R: 9.47 RT: 6.24 ( 37Kops/s) Optimized: over_8888_8888 = L1: 43.67 L2: 33.30 M: 16.32 ( 86.65%) HT: 14.10 VT: 13.78 R: 12.96 RT: 7.85 ( 39Kops/s)
-
Nemanja Lukic authored
After introducing new PRNG (pseudorandom number generator) a bug in two DSPr2 routines was revealed. Bug manifested by wrong calculation in composite and glyph tests, which caused make check to fail for MIPS DSPr2 optimizations. Bug was in the calculation of the: *dst = over (src, *dst) when ma == 0xffffffff In this case src was not negated and shifted right by 24 bits, it was only negated. When implementing this routine in the first place, I missplaced those shifts, which alowed me to combine code for over operation and: UN8x4_MUL_UN8x4 (s, ma); UN8x4_MUL_UN8 (ma, srca); ma = ~ma; UN8x4_MUL_UN8x4_ADD_UN8x4 (d, ma, s); So I decided to rewrite that piece of code from scratch. I changed logic, so now assembly code mimics code from pixman-fast-path.c but processes two pixels at a time. This code should be easier to debug and maintain. The bug was revealed in commit b31a6962. Errors were detected by composite and glyph tests.
- Apr 28, 2013
-
-
Siarhei Siamashka authored
The old code was calculating horizontal weights for right pixels in the following way (for simplicity assume 8-bit interpolation precision): Start with "x = vx" and do increment "x += ux" after each pixel. In this case right pixel weight for interpolation can be calculated as "((x >> 8) ^ 0xFF) + 1", which is the same as "256 - (x >> 8)". The new code instead: Starts with "x = -(vx + 1)", performs increment "x += -ux" after each pixel and calculates right weights as just "(x >> 8) + 1", eliminating the need for XOR operation in the inner loop. So we have one instruction less on the critical path. Benchmarks with "lowlevel-blt-bench -b src_8888_8888" using GCC 4.7.2 on x86-64 system and default optimizations: Intel Core i7 860 (2.8GHz): before: src_8888_8888 = L1: 291.37 L2: 288.58 M:285.38 after: src_8888_8888 = L1: 319.66 L2: 316.47 M:312.06 Intel Core2 T7300 (2GHz): before: src_8888_8888 = L1: 121.95 L2: 118.38 M:118.52 after: src_8888_8888 = L1: 128.82 L2: 125.12 M:124.88 Intel Atom N450 (1.67GHz): before: src_8888_8888 = L1: 64.25 L2: 62.37 M: 61.80 after: src_8888_8888 = L1: 64.23 L2: 62.37 M: 61.82 Inspired by the "sse2_bilinear_interpolation" function (single pixel interpolation) from: http://lists.freedesktop.org/archives/pixman/2013-January/002575.html
-
Siarhei Siamashka authored
Current blitters-test program had difficulties detecting a bug in over_n_8888_8888_ca implementation for MIPS DSPr2: http://lists.freedesktop.org/archives/pixman/2013-March/002645.html In order to hit the buggy code path, two consecutive mask values had to be equal to 0xFFFFFFFF because of loop unrolling. The current blitters-test generates random images in such a way that each byte has 25% probability for having 0xFF value. Hence each 32-bit mask value has ~0.4% probability for 0xFFFFFFFF. Because we are testing many compositing operations with many pixels, encountering at least one 0xFFFFFFFF mask value reasonably fast is not a problem. If a bug related to 0xFFFFFFFF mask value is artificialy introduced into over_n_8888_8888_ca generic C function, it gets detected on 675591 iteration in blitters-test (out of 2000000). However two consecutive 0xFFFFFFFF mask values are much less likely to be generated, so the bug was missed by blitters-test. This patch addresses the problem by also randomly setting the 32-bit values in images to either 0xFFFFFFFF or 0x00000000 (also with 25% probability). It allows to have larger clusters of consecutive 0x00 or 0xFF bytes in images which may have special shortcuts for handling them in unrolled or SIMD optimized code.
-
- Apr 27, 2013
-
-
Stefan Weil authored
They were found by codespell. Signed-off-by:
Stefan Weil <sw@weilnetz.de>
-
- Apr 08, 2013
-
-
Peter Breitenlohner authored
Signed-off-by:
Peter Breitenlohner <peb@mppmu.mpg.de>
-
- Mar 16, 2013
-
-
Søren Sandmann Pedersen authored
The computations in pixman-gradient-walker.c currently take place at very limited 8 bit precision which results in quite visible artefacts in gradients. An example is the one produced by demos/linear-gradient which currently looks like this: http://i.imgur.com/kQbX8nd.png With the changes in this commit, the gradient looks like this: http://i.imgur.com/nUlyuKI.png The images are also available here: http://people.freedesktop.org/~sandmann/gradients/before.png http://people.freedesktop.org/~sandmann/gradients/after.png This patch computes pixels using floating point, but uses a faster algorithm, which makes up for the loss of performance. == Theory: In both the new and the old algorithm, the various gradient implementations compute a parameter x that indicates how far along the gradient the current scanline is. The current algorithm has a cache of the two color stops surrounding the last parameter; those are used in a SIMD-within-register fashion in this way: t1 = walker->left_rb * idist + walker->right_rb * dist; where dist and idist are the distances to the left and right color stops respectively normalized to the distance between the left and right stops. The normalization (which involves a division) is captured in another cached variable "stepper". The cached values are recomputed whenever the parameter moves in between two different stops (called "reset" in the implementation). Because idist and dist are computed in 8 bits only, a lot of information is lost, which is quite visible as the image linked above shows. The new algorithm caches more information in the following way. When interpolating between stops, the formula to be used is this: t = ((x - left) / (right - left)); result = lc * (1 - t) + rc * t; where - x is the parameter as computed by the main gradient code, - left is the position of the left color stop, - right is the position of the right color stop - lc is the color of the left color stop - rc is the color of the right color stop That formula can also be written like this: result = lc * (1 - t) + rc * t; = lc + (rc - lc) * t = lc + (rc - lc) * ((x - left) / (right - left)) = (rc - lc) / (right - left) * x + lc - (left * (rc - lc)) / (right - left) = s * x + b where s = (rc - lc) / (right - left) and b = lc - left * (rc - lc) / (right - left) = (lc * (right - left) - left * (rc - lc)) / (right - left) = (lc * right - rc * left) / (right - left) To summarize, setting w = (right - left): s = (rc - lc) / w b = (lc * right - rc * left) / w r = s * x + b Since s and b only depend on the two active stops, both can be cached so that the computation only needs to do one multiplication and one addition per pixel (followed by premultiplication of the alpha channel). That is, seven multiplications in total, which is the same number as the old SIMD-within-register implementation had. == Implementation notes: The new formula described above is implemented in single precision floating point, and the eight divisions necessary to compute the cached values are done by multiplication with the reciprocal of the distance between the color stops. The alpha values used in the cached computation are scaled by 255.0, whereas the RGB values are kept in the [0, 1] interval. The ensures that after premultiplication, all values will be in the [0, 255] interval. This scaling is done by first dividing all the all the channels by 257, and then later on dividing the r, g, b channels by 255. It would be more natural to do all this scaling in only one place, but inexplicably, that results in a (substantial) slowdown on Sandy Bridge with GCC v 4.7. == Performance impact (median of three runs of radial-perf-test): == Intel Sandy Bridge, Core i3 @ 1.2GHz Before: 0.014553 After: 0.014410 Change: 1.0% faster == AMD Barcelona @ 1.2 GHz Before: 0.021735 After: 0.021328 Change: 1.9% faster Ie., slightly faster, though conceivably there could be a negative impact on machines with a bigger difference between integer and floating point performance. V2: - Use 's' and 'b' in the variable names instead of 'm' and 'd'. This way they match the explanation above - Move variable declarations to the top of the function - Remove unused stepper field - Some formatting fixes - Don't pointlessly include pixman-combine32.h - Don't offset x for each pixel; go back to offsetting left_x and right_x at reset time. The offsets cancel out in the formula above, so there is no impact on the calcualations.
-
- Mar 12, 2013
-
-
Søren Sandmann Pedersen authored
Some upcoming changes to pixman-gradient-walker.c will need this macro.
-
Søren Sandmann Pedersen authored
This benchmark renders one of the radial gradients used in the swfdec-youtube cairo trace 500 times and reports the average time it took. V2: Update .gitignore
-
Søren Sandmann Pedersen authored
This program displays a linear gradient from blue to yellow. Due to limited precision in pixman-gradient-walker.c, it currently has some ugly artefacts that gives it a 'brushed metal' appearance. V2: Update .gitignore
-
- Mar 08, 2013
-
-
Behdad Esfahbod authored
-
- Feb 27, 2013
-
-
Nemanja Lukic authored
- src_0888_8888_rev - src_0888_0565_rev Performance numbers before/after on MIPS-74kc @ 1GHz: lowlevel-blt-bench results Referent (before): src_0888_8888_rev = L1: 51.88 L2: 42.00 M: 19.04 ( 88.50%) HT: 15.27 VT: 14.62 R: 14.13 RT: 7.12 ( 45Kops/s) src_0888_0565_rev = L1: 31.96 L2: 30.90 M: 22.60 ( 75.03%) HT: 15.32 VT: 15.11 R: 14.49 RT: 6.64 ( 43Kops/s) Optimized: src_0888_8888_rev = L1: 222.73 L2: 113.70 M: 20.97 ( 97.35%) HT: 18.31 VT: 17.14 R: 16.71 RT: 9.74 ( 54Kops/s) src_0888_0565_rev = L1: 100.37 L2: 74.27 M: 29.43 ( 97.63%) HT: 22.92 VT: 21.59 R: 20.52 RT: 10.56 ( 56Kops/s)
-
Nemanja Lukic authored
- over_8888_0565 - over_n_8_8 Performance numbers before/after on MIPS-74kc @ 1GHz: lowlevel-blt-bench results Referent (before): over_8888_0565 = L1: 14.30 L2: 13.22 M: 10.43 ( 41.56%) HT: 12.51 VT: 12.95 R: 11.82 RT: 7.34 ( 49Kops/s) over_n_8_8 = L1: 12.77 L2: 16.93 M: 15.03 ( 29.94%) HT: 10.78 VT: 10.72 R: 10.29 RT: 4.92 ( 33Kops/s) Optimized: over_8888_0565 = L1: 26.03 L2: 22.92 M: 15.68 ( 62.43%) HT: 16.19 VT: 16.27 R: 14.93 RT: 8.60 ( 52Kops/s) over_n_8_8 = L1: 62.00 L2: 55.17 M: 40.29 ( 80.23%) HT: 26.77 VT: 25.64 R: 24.13 RT: 10.01 ( 47Kops/s)
-
- Feb 15, 2013
-
-
Søren Sandmann Pedersen authored
GdkPixbufs are not premultiplied, so when using them to display pixman images, there is some unecessary conversions going on: First the image is converted to non-premultiplied, and then GdkPixbuf premultiplies before sending the result to the X server. These conversions may cause the displayed image to not be exactly identical to the original. This patch just uses a cairo image surface instead, which avoids these conversions. Also make the comment about sRGB a little more concise.
-
- Feb 13, 2013
-
-
Ben Avison authored
The source, mask and destination buffers are initialised to 0xCC just after they are allocated. Between each benchmark, there are a pair of memcpys, from the destination buffer to the source buffer and back again (there are no explanatory comments, but presumably this is an effort to flush the caches). However, it has an unintended consequence, which is to change the contents of the buffers on entry to subsequent benchmarks. This means it is not a fair test: for example, with over_n_8888 (featured in the following patches) it reports L2 and even M tests as being faster than the L1 test, because after the L1 test, the source buffer is filled with fully opaque pixels, for which over_n_8888 has a shortcut. The fix here is simply to reverse the order of the memcpys, so src and destination are both filled with 0xCC on entry to all tests.
-
Stefan Weil authored
Some recent code added new type casts from pointer to unsigned long. These type casts result in compiler warnings for systems like MinGW-w64 (64 bit Windows) where sizeof(unsigned long) != sizeof(void *). Signed-off-by:
Stefan Weil <sw@weilnetz.de> Reviewed-by:
Chris Wilson <chris@chris-wilson.co.uk>
-
Søren Sandmann Pedersen authored
If we fail to find a composite function, don't update the fast path cache with the dummy compositing function. Also make the error message state that the bug is likely caused by issues with thread local storage.
-
Søren Sandmann Pedersen authored
While releasing 0.29.2 the distcheck run produced a number of error messages that had to be fixed in 349015e1. These were not caught before so nobody had actually run pixman with debugging turned on. It's not the first time this has happened, see 5b0563f3 for example. So this patch makes the return_if_fail() macros use unlikely() around the expressions and then turns on error logging at all times. The performance hit should negligible since we were already evaluating the expressions. The place where DEBUG actually does cause a performance hit is in the region selfcheck code, and that will still only be enabled in development snapshots.
-
Søren Sandmann Pedersen authored
When compiling with GCC this macro expands to __builtin_expect((expr), 0). On other compilers, it just expands to (expr).
-
Søren Sandmann Pedersen authored
The check-formats programs reveals that the 8 bit pipeline cannot meet the current 0.004 acceptable deviation specified in utils.c, so we have to increase it. Some of the failing pixels were captured in pixel-test, which with this commit now passes. == a4r4g4b4 DISJOINT_XOR a8r8g8b8 == The DISJOINT_XOR operator applied to an a4r4g4b4 source pixel of 0xd0c0 and a destination pixel of 0x5300ea00 results in the exact value: fa = (1 - da) / sa = (1 - 0x53 / 255.0) / (0xd / 15.0) = 0.7782 fb = (1 - sa) / da = (1 - 0xd / 15.0) / (0x53 / 255.0) = 0.4096 r = fa * (0xc / 15.0) + fb * (0xea / 255.0) = 0.99853 But when computing in 8 bits, we get: fa8 = ((255 - 0x53) * 255 + 0xdd / 2) / 0xdd = 0xc6 fb8 = ((255 - 0xdd) * 255 + 0x53 / 3) / 0x53 = 0x68 r8 = (fa8 * 0xcc + 127) / 255 + (fb8 * 0xea + 127) / 255 = 0xfd and 0xfd / 255.0 = 0.9921568627450981 for a deviation of 0.00637118610187, which we then have to consider acceptable given the current implementation. By switching to computing the result with r = (fa * s + fb * d + 127) / 255 rather than r = (fa * s + 127) / 255 + (fb * d + 127) / 255 the deviation would be only 0.00244961747442, so at some point it may be worth doing either this, or switching to floating point for operators that involve divisions. Note that the conversion from 4 bits to 8 bits does not cause any error in this case because both rounding and bit replication produces an exact result when the number of from-bits divide the number of to-bits. == a8r8g8b8 OVER r5g6b5 == When OVER compositing the a8r8g8b8 pixel 0x0f00c300 with the x14r6g6b6 pixel 0x03c0, the true floating point value of the resulting green channel is: 0xc3 / 255.0 + (1.0 - 0x0f / 255.0) * (0x0f / 63.0) = 0.9887955 but when compositing 8 bit values, where the 6-bit green channel is converted to 8 bit through bit replication, the 8-bit result is: 0xc3 + ((255 - 0x0f) * 0x3c + 127) / 255 = 251 which corresponds to a real value of 0.984314. The difference from the true value is 0.004482 which is bigger than the acceptable deviation of 0.004. So, if we were to compute all the CONJOINT/DISJOINT operators in floating point, or otherwise make them more accurate, the acceptable deviation could be set at 0.0045. If we were doing the 6-bit conversion with rounding: (x / 63.0 * 255.0 + 0.5) instead of bit replication, the deviation in this particular case would be only 0.0005, so we may want to consider this at some point.
-
Søren Sandmann Pedersen authored
This test program contains a table of individual operator/pixel combinations. For each pixel combination, images of various sizes are filled with the pixels and then composited. The result is then verified against the output of do_composite(). If the result doesn't match, detailed error information is printed. The initial 14 pixel combinations currently all fail.
-
Søren Sandmann Pedersen authored
The check-formats.c test depends on the exact format of the strings returned from these functions, so add a test here. a1-trap-test isn't the ideal place, but it seems like overkill to add a new test just for these trivial checks.
-
Søren Sandmann Pedersen authored
Given an operator and two formats, this program will composite and check all pixels where the red and blue channels are 0. That is, if the two formats are a8r8g8b8 and a4r4g4b4, all source pixels matching the mask 0xff00ff00 are composited with the given operator against all destination pixels matching the mask 0xf0f0 and the result is then verified against the do_composite() function that was moved to utils.c earlier. This program reveals that a number of operators and format combinations are not computed to within the precision currently accepted by pixel_checker_t. For example: check-formats over a8r8g8b8 r5g6b5 | grep failed | wc -l 30 reveals that there are 30 pixel combinations where OVER produces insufficiently precise results for the a8r8g8b8 and r5g6b5 formats.
-
Søren Sandmann Pedersen authored
This function returns the a, r, g, and b masks corresponding to the pixel checker's format.
-
Søren Sandmann Pedersen authored
This function takes a pixel in the format corresponding to the pixel checker, and converts to a color_t.
-
Søren Sandmann Pedersen authored
So that it can be used in other tests.
-
- Jan 30, 2013
-
-
Søren Sandmann Pedersen authored
-
Søren Sandmann Pedersen authored
In c2cb303d, return_if_fail()s were added to prevent the trapezoid rasterizers from being called with non-alpha formats. However, stress-test actually does call the rasterizers with non-alpha formats, but because _pixman_log_error() is disabled in versions with an odd minor number, the errors never materialized. Fix this by changing the argument to random format to an enum of three values DONT_CARE, PREFER_ALPHA, or REQUIRE_ALPHA, and then in the switch that calls the trapezoid rasterizers, pass the appropriate value for the function in question.
-
- Jan 29, 2013
-
-
Søren Sandmann Pedersen authored
The old one belongs to the email address sandmann@daimi.au.dk, which doesn't work anyore. Also use gpg to get the name and address for the "(Signed by ...)" line since that works more reliably for me than using git.
-
Ben Avison authored
In particular this affects single-core ARMs (e.g. ARM11, Cortex-A8), which are usually configured this way. For other CPUs, this should only add a constant time, which will be cancelled out by the EXCLUDE_OVERHEAD runs. The problems were caused by cachelines becoming permanently evicted from the cache, because the code that was intended to pull them back in again on each iteration assumed too long a cache line (for the L1 test) or failed to read memory beyond the first pixel row (for the L2 test). Also, the reloading of the source buffer was unnecessary. These issues were identified by Siarhei in this post: http://lists.freedesktop.org/archives/pixman/2013-January/002543.html
-