[Pixman] [PATCH] sse2: Using MMX and SSE 4.1

Sun May 13 10:14:39 PDT 2012

On Wed, May 9, 2012 at 7:57 PM, Søren Sandmann <sandmann at cs.au.dk> wrote:
> Matt Turner <mattst88 at gmail.com> writes:
>
>> I started porting my src_8888_0565 MMX function to SSE2, and in the
>> process started thinking about using SSE3+. The useful instructions
>> added post SSE2 that I see are
>>       SSE3:   lddqu - for unaligned loads across cache lines
>
> I don't really understand that instruction. Isn't it identical to
> movdqu?  Or is the idea that lddqu is faster than movdqu for cache line
> splits, but slower for plain old, non-cache split unaligned loads?
>
>>       SSSE3:  palignr - for unaligned loads (but requires software
>>                         pipelining...)
>>               pmaddubsw - maybe?
>
> pmaddubsw would be very useful for bilinear interpolation if we drop
> coordinate precision to 7 bits instead of the current 8. One example way
> to use it is to put 7-bit values of (1 - x, x, 1 - x, x) in a register,
> and interleave a top/left and top/right pixels in another. pmaddubsw on
> those two registers will then produce a linear interpolation between the
> top top pixels. A similar thing can be done for the bottom pixels, and
> then the intermediate results can be interleaved and combined using
> pmaddwd.

I would say that improving bilinear scaling performance on x86 is
really important for pixman in order to remain competitive. The
following link might be a good source of inspiration:
    http://www.hackermusings.com/2012/05/firefoxs-graphics-performance-on-x11/

The comments with the azure backend performance numbers are
particularly interesting. For example, one of them mentions 12fps with
xrender disabled (using pixman?) vs. 15fps with azure canvas enabled
(using skia?) for FishIETank. Needless to say that it would be nice to
improve pixman performance by 30% or more.

And here are some benchmarks for firefox-fishtank trace with
pixman-0.25.2, comparing NEON vs. SSE2 for ARM Cortex-A8 and intel
Atom (both are superscalar dual-issue in-order cores):

=== ARM Cortex-A8 @1GHz ===

CC=gcc-4.5.3 CFLAGS="-O2 -mcpu=cortex-a8 -mfpu=neon -mfloat-abi=softfp"
[  0]    image             firefox-fishtank  359.228  359.436   0.43%    3/3

CC=gcc-4.5.3 CFLAGS="-O2 -mcpu=cortex-a8 -mfpu=neon -mfloat-abi=softfp -mthumb"
[  0]    image             firefox-fishtank  347.195  347.773   0.12%    3/3

=== Intel Atom N450 @1.67GHz ===

CC=gcc-4.5.3 CFLAGS="-O2 -g -march=atom -mtune=atom"
[  0]    image             firefox-fishtank  308.439  308.881   0.09%    3/3

CC=gcc-4.5.3 CFLAGS="-O2 -g -march=atom -mtune=atom -m32"
[  0]    image             firefox-fishtank  309.457  309.568   0.07%    3/3

CC=gcc-4.5.3 CFLAGS="-O2"
[  0]    image             firefox-fishtank  345.906  346.156   0.04%    3/3

CC=gcc-4.5.3 CFLAGS="-O2 -mtune=generic"
[  0]    image             firefox-fishtank  345.367  345.900   0.09%    3/3

The results for gcc-4.7.0 were nearly the same. Currently 1GHz ARM
Cortex-A8 is almost as fast as 1.67GHz Atom. ARM NEON bilinear code is
using 8-bit multiplications. Atom could use PMADDUBSW to also benefit
from 8-bit multiplications and improve performance per MHz.

-- 
Best regards,
Siarhei Siamashka