[Pixman] [PATCH 2/2] MIPS: DSPr2: Added bilinear over_8888_8_8888 fast path.

Fri May 11 01:55:00 PDT 2012

On Thu, May 3, 2012 at 1:03 AM, Nemanja Lukic <nlukic at mips.com> wrote:
> From: Nemanja Lukic <nemanja.lukic at rt-rk.com>
>
> Performance numbers before/after on MIPS-74kc @ 1GHz
>
> Referent (before):
>
> cairo-perf-trace:
> [ # ]  backend                         test   min(s) median(s) stddev. count
> [ # ]    image: pixman 0.25.3
> [  0]    image             firefox-fishtank 2289.180 2290.567   0.05%    5/6
>
> Optimized:
>
> cairo-perf-trace:
> [ # ]  backend                         test   min(s) median(s) stddev. count
> [ # ]    image: pixman 0.25.3
> [  0]    image             firefox-fishtank 1700.925 1708.314   0.22%    5/6

This definitely is an improvement. But the firefox-fishtank trace is
very dependent on bilinear scaling performance, both x86 SSE2 and ARM
NEON demonstrate more than 3x speedup here:
    http://ssvb.github.com/2012/05/04/xorg-drivers-and-software-rendering.html

I understand that MIPS DSPr2 does not stand a chance competing with
128-bit SIMD competitors, but still some more performance tweaks can
be be probably applied. See more comments below.

> diff --git a/pixman/pixman-mips-dspr2-asm.h b/pixman/pixman-mips-dspr2-asm.h
> index 8383060..7cf3281 100644
> --- a/pixman/pixman-mips-dspr2-asm.h
> +++ b/pixman/pixman-mips-dspr2-asm.h
> @@ -566,4 +566,60 @@ LEAF_MIPS32R2(symbol)                                   \
>     addu_s.qb              \out2_8888, \d2_8888,  \scratch2
>  .endm
>
> +.macro BILINEAR_INTERPOLATE_SINGLE_PIXEL tl, tr, bl, br,         \
> +                                         scratch1, scratch2,     \
> +                                         alpha, red, green, blue \
> +                                         wt1, wt2, wb1, wb2
> +    andi            \scratch1, \tl,  0xff
> +    andi            \scratch2, \tr,  0xff
> +    andi            \alpha,    \bl,  0xff
> +    andi            \red,      \br,  0xff

I suggest to have a look at
    http://lists.freedesktop.org/archives/pixman/2011-February/001088.html

The ANDI/EXT instructions from BILINEAR_INTERPOLATE_SINGLE_PIXEL macro
could be replaced with byte load instructions. MIPS74K can't dual
issue ALU+ALU instructions, but can dual issue LS+ALU. This look like
a potentially huge performance win on MIPS74K hardware, far exceeding
the speedup observed on x86.

Why is the faster C bilinear code from my old post still not in
pixman? As I mentioned there, "the discussion is still ongoing about
how to improve bilinear scaling performance when SIMD extensions are
not available". Reducing interpolation precision from the current
8-bit to 7-bit allows to use signed multiplications and can help a lot
x86 MMX/SSE2/SSSE3 code. It may even make sense reducing interpolation
precision further to 4-bit as suggested by Taekyun Kim at that time:
    http://lists.freedesktop.org/archives/pixman/2011-February/001044.html
This allows to halve the number of multiplications for bilinear
interpolation in C code by using SIMD-alike tricks.

But both Taekyun Kim and I were mostly interested in ARM NEON
performance, and NEON happens not to suffer from 8-bit interpolation
much. Nobody else has tried pushing interpolation precision reduction
for faster bilinear interpolation into pixman and .... it did not
happen. But the hope is not totally lost, see the recent discussion:
    http://lists.freedesktop.org/archives/pixman/2012-May/001930.html

Regarding how it affects you. If bilinear interpolation precision gets
changed after all, your optimized code in bilinear over_8888_8_8888
fast path will need to be updated (if we still care about getting
identical results everywhere and passing the test suite). You may also
want to take part in this activity and evaluate the effects of 8-bit
vs. 7-bit vs. 4-bit interpolation for MIPS.

> +    multu           $ac0,      \wt1, \scratch1
> +    maddu           $ac0,      \wt2, \scratch2
> +    maddu           $ac0,      \wb1, \alpha
> +    maddu           $ac0,      \wb2, \red
> +
> +    ext             \scratch1, \tl,  8, 8
> +    ext             \scratch2, \tr,  8, 8
> +    ext             \alpha,    \bl,  8, 8
> +    ext             \red,      \br,  8, 8
> +
> +    multu           $ac1,      \wt1, \scratch1
> +    maddu           $ac1,      \wt2, \scratch2
> +    maddu           $ac1,      \wb1, \alpha
> +    maddu           $ac1,      \wb2, \red
> +
> +    ext             \scratch1, \tl,  16, 8
> +    ext             \scratch2, \tr,  16, 8
> +    ext             \alpha,    \bl,  16, 8
> +    ext             \red,      \br,  16, 8
> +
> +    mflo            \blue,     $ac0
> +
> +    multu           $ac2,      \wt1, \scratch1
> +    maddu           $ac2,      \wt2, \scratch2
> +    maddu           $ac2,      \wb1, \alpha
> +    maddu           $ac2,      \wb2, \red
> +
> +    ext             \scratch1, \tl,  24, 8
> +    ext             \scratch2, \tr,  24, 8
> +    ext             \alpha,    \bl,  24, 8
> +    ext             \red,      \br,  24, 8
> +
> +    mflo            \green,    $ac1
> +
> +    multu           $ac3,      \wt1, \scratch1
> +    maddu           $ac3,      \wt2, \scratch2
> +    maddu           $ac3,      \wb1, \alpha
> +    maddu           $ac3,      \wb2, \red
> +
> +    mflo            \red,      $ac2
> +    mflo            \alpha,    $ac3
> +
> +    precr.qb.ph     \alpha,    \alpha, \red
> +    precr.qb.ph     \scratch1, \green, \blue
> +    precrq.qb.ph    \tl,       \alpha, \scratch1

Here you are combining RGBA values and split them again later in
OVER_8888_8_8888 macro. Could this be exploited somehow?

-- 
Best regards,
Siarhei Siamashka