[Pixman] [PATCH 2/2] MIPS: DSPr2: Added bilinear over_8888_8_8888 fast path.

Siarhei Siamashka siarhei.siamashka at gmail.com
Tue May 15 19:27:33 PDT 2012


On Wed, May 16, 2012 at 1:34 AM, Matt Turner <mattst88 at gmail.com> wrote:
> On Tue, May 15, 2012 at 5:37 PM, Siarhei Siamashka <siarhei.siamashka at gmail.com> wrote:
>>> I still need to add improvement for that packing/unpacking of the RGBA pixels
>>> after bilinear/before OVER operation, but I don't expect big improvement
>>> there (it is just a couple of instructions).
>>
>> It's not just a couple of instructions. By combining the color
>> channels in a register, you are also forcing the processor to finish
>> the calculations for all the needed data. And this is an extra data
>> dependency, which may inhibit instructions reordering.
>
> Indeed. For instance this is part of the reason why [1] made such a
> large difference.
>
> [1] http://cgit.freedesktop.org/pixman/commit/?id=7d4beedc612a32b73d7673bbf6447de0f3fca298

Actually there is one more interesting trick to try.

Immediately after combining color components, MIPS_UN8x4_MUL_UN8 macro
is used to process the combined value "s_8888" and needs 12
instructions:

+/*
+ * Multiply pixel (a8) with single pixel (a8r8g8b8). It requires maskLSR
+ * needed for rounding process. maskLSR must have following value:
+ *   li       maskLSR, 0x00ff00ff
+ */
+.macro MIPS_UN8x4_MUL_UN8 s_8888,  \
+                          m_8,     \
+                          d_8888,  \
+                          maskLSR, \
+                          scratch1, scratch2, scratch3
+    replv.ph          \m_8,      \m_8                 /*   0 | M | 0 | M */
+    muleu_s.ph.qbl    \scratch1, \s_8888,   \m_8      /*    A*M  |  R*M */
+    muleu_s.ph.qbr    \scratch2, \s_8888,   \m_8      /*    G*M  |  B*M */
+    shra_r.ph         \scratch3, \scratch1, 8
+    shra_r.ph         \d_8888,   \scratch2, 8
+    and               \scratch3, \scratch3, \maskLSR  /*   0 |A*M| 0 |R*M */
+    and               \d_8888,   \d_8888,   \maskLSR  /*   0 |G*M| 0 |B*M */
+    addq.ph           \scratch1, \scratch1, \scratch3 /* A*M+A*M | R*M+R*M */
+    addq.ph           \scratch2, \scratch2, \d_8888   /* G*M+G*M | B*M+B*M */
+    shra_r.ph         \scratch1, \scratch1, 8
+    shra_r.ph         \scratch2, \scratch2, 8
+    precr.qb.ph       \d_8888,   \scratch1, \scratch2
+.endm

A possible alternative way is to just use a single MULQ_RS.W
instruction for each color component. That's total 5 instructions
because 8-bit alpha value from mask needs to be premultiplied by
8421504. A test program is listed below:

/***********************/

#include <stdio.h>
#include <stdint.h>

int mul_un8(int a, int b)
{
#if 1
    int t = a * b + 0x80;
    return (t + (t >> 8)) >> 8;
#else
    return (a * b + 127) / 255;
#endif
}

int mul_un8_mips(int a, int b)
{
    int c;
    b *= 8421504;
#if 1
    asm ("mulq_rs.w %0, %1, %2" : "=r" (c) : "r" (a), "r" (b));
#else
    c = ((int64_t)a * b + (1 << 30)) >> 31;
#endif
    return c;
}

int main()
{
    int a, b;
    for (a = 0; a < 256; a++)
    {
        for (b = 0; b < 256; b++)
        {
            if (mul_un8(a, b) != mul_un8_mips(a, b))
            {
                printf("test failed! a=%d b=%d\n", a, b);
                return 1;
            }
        }
    }
    printf("test passed\n");
    return 0;
}

/***********************/

There is only one problem with MULQ_RS.W instruction: it seems to have
a huge ~14 (!) cycles latency. But the throughput is ok (1 cycle per
instruction). So in order to hide this latency, some serious loop
unrolling may be needed (possibly up to handling 4 pixels at once).
The other MIPS DSP ASE fast path functions may also try to use
MULQ_RS.W

-- 
Best regards,
Siarhei Siamashka


More information about the Pixman mailing list