[Pixman] [PATCH 2/2] MIPS: DSPr2: Added bilinear over_8888_8_8888 fast path.

Sun May 20 14:10:50 PDT 2012

On Wed, May 16, 2012 at 5:27 AM, Siarhei Siamashka
<siarhei.siamashka at gmail.com> wrote:
> +/*
> + * Multiply pixel (a8) with single pixel (a8r8g8b8). It requires maskLSR
> + * needed for rounding process. maskLSR must have following value:
> + *   li       maskLSR, 0x00ff00ff
> + */
> +.macro MIPS_UN8x4_MUL_UN8 s_8888,  \
> +                          m_8,     \
> +                          d_8888,  \
> +                          maskLSR, \
> +                          scratch1, scratch2, scratch3
> +    replv.ph          \m_8,      \m_8                 /*   0 | M | 0 | M */
> +    muleu_s.ph.qbl    \scratch1, \s_8888,   \m_8      /*    A*M  |  R*M */
> +    muleu_s.ph.qbr    \scratch2, \s_8888,   \m_8      /*    G*M  |  B*M */
> +    shra_r.ph         \scratch3, \scratch1, 8
> +    shra_r.ph         \d_8888,   \scratch2, 8
> +    and               \scratch3, \scratch3, \maskLSR  /*   0 |A*M| 0 |R*M */
> +    and               \d_8888,   \d_8888,   \maskLSR  /*   0 |G*M| 0 |B*M */
> +    addq.ph           \scratch1, \scratch1, \scratch3 /* A*M+A*M | R*M+R*M */
> +    addq.ph           \scratch2, \scratch2, \d_8888   /* G*M+G*M | B*M+B*M */
> +    shra_r.ph         \scratch1, \scratch1, 8
> +    shra_r.ph         \scratch2, \scratch2, 8
> +    precr.qb.ph       \d_8888,   \scratch1, \scratch2
> +.endm
>
> A possible alternative way is to just use a single MULQ_RS.W
> instruction for each color component. That's total 5 instructions
> because 8-bit alpha value from mask needs to be premultiplied by
> 8421504. A test program is listed below:
>
> /***********************/
>
> #include <stdio.h>
> #include <stdint.h>
>
> int mul_un8(int a, int b)
> {
> #if 1
>    int t = a * b + 0x80;
>    return (t + (t >> 8)) >> 8;
> #else
>    return (a * b + 127) / 255;
> #endif
> }
>
> int mul_un8_mips(int a, int b)
> {
>    int c;
>    b *= 8421504;
> #if 1
>    asm ("mulq_rs.w %0, %1, %2" : "=r" (c) : "r" (a), "r" (b));
> #else
>    c = ((int64_t)a * b + (1 << 30)) >> 31;
> #endif
>    return c;
> }
>
> int main()
> {
>    int a, b;
>    for (a = 0; a < 256; a++)
>    {
>        for (b = 0; b < 256; b++)
>        {
>            if (mul_un8(a, b) != mul_un8_mips(a, b))
>            {
>                printf("test failed! a=%d b=%d\n", a, b);
>                return 1;
>            }
>        }
>    }
>    printf("test passed\n");
>    return 0;
> }
>
> /***********************/
>
> There is only one problem with MULQ_RS.W instruction: it seems to have
> a huge ~14 (!) cycles latency. But the throughput is ok (1 cycle per
> instruction). So in order to hide this latency, some serious loop
> unrolling may be needed (possibly up to handling 4 pixels at once).
> The other MIPS DSP ASE fast path functions may also try to use
> MULQ_RS.W

Maybe a bit more explanations would be useful. When multiplying 8-bit
color component by 8-bit alpha for OVER operator, pixman actually
wants to do:

    x' = (x * a + (255 / 2)) / 255;

This is division by 255 and rounding the result to nearest integer.
Because integer division is slow, C implementation replaces it with
shifts and additions:
    http://cgit.freedesktop.org/pixman/tree/pixman/pixman-combine.h.template?id=pixman-0.24.4#n27

    t  = x * a + 0x80;
    x' = (t + (t >> 8)) >> 8;

This method of calculation is also good because the intermediate
results fit unsigned 16-bit variables. Which also allows to use
SIMD-alike trick to process two color components at once on 32-bit
systems:
    http://cgit.freedesktop.org/pixman/tree/pixman/pixman-combine.h.template?id=pixman-0.24.4#n40

    /* xy = 0x00AB00CD, where AB is one color component, CD - another  */
    t   = xy * a + 0x00800080;
    xy` = ((t + ((t >> 8) & 0x00FF00FF)) >> 8) & 0x00FF00FF;

But modern processors may support some nice instructions which can do
this job even better. It makes sense to revert C optimizations and
look at the original ((x * a + (255 / 2)) / 255) formula again.

MIPS DSP ASE has a special instruction MULQ_RS.W for rounded fixed
point Q31 multiplication (((int64_t)a * b + (1 << 30)) >> 31) and it
can be used quite conveniently here because we get a shift and
rounding for free. We just need to use the Q31 representation of 1/255
for multiplication by reciprocal. MIPS DSP ASE also supports Q15 fixed
point multiplication and it could have been even nicer, but Q15
precision is apparently insufficient for getting bit exact results in
this case.

Looking at the original intended formula is always useful. Because it
allows to try and benchmark alternative implementations of the same
calculations, selecting the best one for the target hardware. Maybe
MIPS r5g6b5 -> x8r8g8b8 pixel format conversion could also borrow some
ideas from the other discussion thread:
    http://lists.freedesktop.org/archives/pixman/2012-May/001958.html

-- 
Best regards,
Siarhei Siamashka