[Pixman] [PATCH] sse2: faster bilinear interpolation (get rid of XOR instruction)

Siarhei Siamashka siarhei.siamashka at gmail.com
Mon Jan 28 07:52:12 PST 2013


On Mon, 28 Jan 2013 07:40:05 +0200
Siarhei Siamashka <siarhei.siamashka at gmail.com> wrote:

> The old code was calculating horizontal weights for right pixels
> in the following way (for simplicity assume 8-bit interpolation
> precision):
> 
>   Start with "x = vx" and do increment "x += ux" after each pixel.
>   In this case right pixel weight for interpolation can be calculated
>   as "((x >> 8) ^ 0xFF) + 1", which is the same as "256 - (x >> 8)".
> 
> The new code instead:
> 
>   Starts with "x = -(vx + 1)", performs increment "x += -ux" after
>   each pixel and calculates right weights as just "(x >> 8) + 1",
>   eliminating the need for XOR operation in the inner loop.
> 
> So we have one instruction less on the critical path. Benchmarks
> with "lowlevel-blt-bench -b src_8888_8888" using GCC 4.7.2 on
> x86-64 system and default optimizations:
> 
> Intel Core i7 860 (2.8GHz):
>     before: src_8888_8888 =  L1: 359.00  L2: 354.78  M:348.82
>     after:  src_8888_8888 =  L1: 402.24  L2: 391.12  M:386.51

The MPix/s numbers should be actually the following when running
at real 2.8GHz clock speed:

    before: src_8888_8888 =  L1: 291.37  L2: 288.58  M:285.38
    after:  src_8888_8888 =  L1: 319.66  L2: 316.47  M:312.06

Apparently the recent kernel upgrade on my PC got Turbo Boost
enabled without me noticing. One always needs to watch out :-/

> Intel Core2 T7300 (2GHz):
>     before: src_8888_8888 =  L1: 121.95  L2: 118.38  M:118.52
>     after:  src_8888_8888 =  L1: 128.82  L2: 125.12  M:124.88
> 
> Intel Atom N450 (1.67GHz):
>     before: src_8888_8888 =  L1:  64.25  L2:  62.37  M: 61.80
>     after:  src_8888_8888 =  L1:  64.23  L2:  62.37  M: 61.82

Still the performance per MHz significantly changes for different
CPU generations (that's all the x86 hardware I have here):

Core i7 (Lynnfield) :  ~9 cycles per pixel (down from ~10 cycles)
Core2 (Merom)       : ~16 cycles per pixel (down from ~17 cycles)
Atom (Pineview)     : ~27 cycles per pixel (has not changed)

There is something really wrong with Atom. It should not normally
perform so bad on this code and there must be a good reason. In any
case, looks like Atom has a *huge* room for improvement on bilinear
scaling. The code in question is ~20 instructions, which are used per
one pixel as can be seen in the objdump log here (and SSSE3, supported
by nearly all modern CPUs, can reduce this number further):

    http://lists.freedesktop.org/archives/pixman/2013-January/002549.html

BTW, not sure if it can help for real, but the compiler should have
a lot more room for optimizations if we add __restrict keyword to
the pointers used in the bilinear scaling code and also drop
-fno-strict-aliasing compiler option which is really supposed to
inhibit this type of optimizations. The problem is that the compiler
simply can't reorder reads from the source image relative to the
writes to the destination image unless it is explicitly told that
they are not aliased. Again, I'm not sure if it would map to real
performance improvements with the current GCC versions and the
current pixman code, but at least the compiler will not have a
good excuse for generating bad code for SSE2 intrinsics anymore :-)

-- 
Best regards,
Siarhei Siamashka


More information about the Pixman mailing list