[Pixman] [PATCH] ARMv5 optimisations

Sat Sep 29 16:46:57 PDT 2012

On Thu, 27 Sep 2012 15:18:47 +1200
Andre Renaud <andre at bluewatersys.com> wrote:

> Hi,
> I've put together an ARMv5 optimised blitter & fill routine (heavily
> based on some code from the Android pixelflinger system). It provides
> 16 & 32-bit fills, over 8888 -> 8888 blitting and over 8888 -> 0565
> blitting. I realise that ARMv5 is getting pretty old now, but it's
> probably still popular enough that someone else might want this.

Hi. Thanks for sharing this information and patches. The benchmark
numbers for pixelflinger code are especially interesting.

However running "make check" shows that this fails some pixman
correctness expectations:

PASS: a1-trap-test
PASS: pdf-op-test
PASS: region-test
PASS: region-translate-test
PASS: fetch-test
rotate test passed (checksum=03A24D51)
PASS: rotate-test
PASS: oob-test
PASS: infinite-loop
PASS: trap-crasher
PASS: alpha-loop
PASS: scaling-crash-test
PASS: scaling-helpers-test
PASS: gradient-crash-test
region_contains test passed (checksum=D2BF8C73)
PASS: region-contains-test
PASS: alphamap
PASS: stress-test
composite traps test passed (checksum=E3112106)
PASS: composite-traps-test
blitters test failed! (checksum=03E77E23, expected 3E1DD2E8)
FAIL: blitters-test
glyph test passed (checksum=741CB2DB)
PASS: glyph-test
scaling test failed! (checksum=5BC5CFDC, expected 03A23E0C)
FAIL: scaling-test
affine test failed! (checksum=12AA0D73, expected 74050F50)
FAIL: affine-test
---- Test 4555550 failed ----
Operator:      OVER 
Source:        a8r8g8b8, 1x1
Destination:   a8r8g8b8, 1x1

               R     G     B     A         Rounded
Source color:  1.000 1.000 1.000 0.000     1.000 1.000 1.000 0.000
Dest. color:   1.000 1.000 1.000 1.000     1.000 1.000 1.000 1.000
Expected:      1.000 1.000 1.000 1.000
Got:             255   255   254     0  [pixel: 0x00fffffe]
Min accepted:    254   254   254   254
Max accepted:    256   256   256   256
Test 0x0045831E failed.
FAIL: composite
=============================================
4 of 22 tests failed

> I've attached the patch here (please excuse the lack of inline patch -
> I'll blame that on a combination of my lack of git knowledge, and
> gmails inability to do text emails without line wrapping).

There is a useful manual about git here:
    http://www.kernel.org/pub/software/scm/git/docs/user-manual.html#patch-series

Use "git format-patch --cover-letter HEAD^^^^" to create the patches.
The number of '^' characters above specifies the number of patches in
your series you want to submit. Edit the cover letter and then use
"git send-email --to=pixman at lists.freedesktop.org *.patch" to send the
patches to the list.

Github also has some introductory documentation. Nowadays it's
difficult to do software development without being familiar with git.
It is used by way too many open source projects :)

> As the code comes from the Android code base, it is Apache licensed -
> is this a problem with Pixman?

I am not a lawyer, but Apache license does not seem to be
compatible with GPL2 according to:
   http://www.apache.org/licenses/GPL-compatibility.html
This is an extra restriction, which does not exist in the current
pixman MIT/X11 license. I personally would not touch Apache licensed
code without a really good reason.

> For performance results, here are the relevant numbers (tested on an
> AT91SAM9G45 - ARM926EJS). The first set of the pair is before the
> change, the second is after.
> 
> reference memcpy speed = 230.1MB/s (57.5MP/s for 32bpp fills)
> 
>               src_n_0565 =  L1:  19.33  L2:  19.75  M: 19.74 ( 17.16%)
>  HT: 15.77  VT: 15.27  R: 15.25  RT:  7.60 (  46Kops/s)
>               src_n_0565 =  L1: 195.58  L2: 126.99  M:128.46 (110.53%)
>  HT: 46.37  VT: 45.73  R: 42.87  RT: 11.26 (  50Kops/s)
> 
>               src_n_x888 =  L1:  19.32  L2:  19.75  M: 19.78 ( 34.39%)
>  HT: 15.79  VT: 15.38  R: 15.31  RT:  7.64 (  48Kops/s)
>               src_n_x888 =  L1: 111.67  L2:  66.34  M: 65.85 (113.33%)
>  HT: 36.00  VT: 35.79  R: 33.80  RT: 11.09 (  50Kops/s)
> 
>               src_n_8888 =  L1:  19.33  L2:  19.75  M: 19.78 ( 34.39%)
>  HT: 15.80  VT: 15.39  R: 15.32  RT:  7.64 (  48Kops/s)
>               src_n_8888 =  L1: 109.40  L2:  66.36  M: 65.89 (113.39%)
>  HT: 36.03  VT: 35.90  R: 33.81  RT: 11.11 (  50Kops/s)
> 
>           over_8888_0565 =  L1:   5.08  L2:   4.49  M:  4.48 ( 11.68%)
>  HT:  2.80  VT:  2.71  R:  2.69  RT:  2.22 (  23Kops/s)
>           over_8888_0565 =  L1:  20.96  L2:  14.48  M: 14.63 ( 37.78%)
>  HT:  8.95  VT:  8.60  R:  8.19  RT:  4.48 (  37Kops/s)
> 
>           over_8888_8888 =  L1:   3.21  L2:   3.17  M:  4.22 ( 14.68%)
>  HT:  6.20  VT:  6.58  R:  6.63  RT:  4.28 (  36Kops/s)
>           over_8888_8888 =  L1:  21.84  L2:  12.40  M: 12.88 ( 44.32%)
>  HT:  8.77  VT:  8.85  R:  8.84  RT:  4.68 (  38Kops/s)
> 
>           over_8888_x888 =  L1:   9.24  L2:   8.25  M:  8.12 ( 28.22%)
>  HT:  6.83  VT:  6.77  R:  6.67  RT:  4.29 (  36Kops/s)
>           over_8888_x888 =  L1:  21.83  L2:  12.39  M: 12.88 ( 44.32%)
>  HT:  8.76  VT:  8.85  R:  8.84  RT:  4.68 (  38Kops/s)

As for the patch, there are some comments:

> +dnl Check for ARMv5
> +
> +AC_ARG_ENABLE(arm-v5,
> +   [AC_HELP_STRING([--disable-arm-v5],
> +                   [disable ARM V5 fast paths])],
> +   [enable_arm_v5=$enableval], [enable_arm_v5=no])
> +
> +if test $enable_arm_v5 = no ; then
> +   have_arm_v5=disabled
> +else
> +    have_arm_v5=yes
> +fi
> +
> +if test $have_arm_v5 = yes ; then
> +   AC_DEFINE(USE_ARM_V5, 1, [use ARM v5 assembly optimizations])
> +fi
> +
> +AM_CONDITIONAL(USE_ARM_V5, test $have_arm_v5 = yes)
> +
> +AC_MSG_CHECKING(whether to use ARM V5 assembler)
> +AC_MSG_RESULT($have_arm_v5)

Looks like arm-v5 support is disabled by default unless overrided by
configure option. This is not consistent with the rest of the cpu
specific optimizations (arm-simd, arm-neon). Normally everything is
compiled in, but only gets used when the needed cpu features are
detected at runtime.

> +pixman_arm_blitrow_32:
> +
> +    push     {r4-r11}
> +/*
> + * r0 - dst
> + * r1 - src
> + * r2 - count
> + */
> +.Lresidual_loop:
> +    mov      r10, #0xFF
> +    orr      r10, r10, r10, lsl #16    //mask = r10 = 0x00FF00FF
> +
> +    subs     r2, r2, #2
> +    blt      .Lblitrow32_single_loop
> +
> +.Lblitrow32_double_loop:
> +    ldm      r0, {r3, r4}
> +    ldm      r1!, {r5, r6}
> +
> +    orrs     r9, r3, r4
> +    beq      .Lblitrow32_loop_cond
> +
> +    // First iteration
> +    lsr      r7, r5, #24               //extract alpha
> +    and      r8, r3, r10               //rb = (dst & mask)
> +    rsb      r7, r7, #256              //r5 = scale = (255-alpha)+1
> +    and      r9, r10, r3, lsr #8       //ag = (dst>>8) & mask
> +
> +    mul      r11, r8, r7               //RB = rb * scale
> +    mul      r3, r9, r7                //AG = ag * scale
> +
> +    // combine RB and AG
> +    and      r11, r10, r11, lsr #8     //r8 = (RB>>8) & mask
> +    and      r3, r3, r10, lsl #8       //r9 = AG & ~mask
> +
> +    lsr      r7, r6, #24               //extract alpha for second
> iteration
> +    orr      r3, r3, r11
> +
> +    // Second iteration
> +    and      r8, r4, r10               //rb = (dst & mask)
> +    rsb      r7, r7, #256              //r5 = scale = (255-alpha)+1
> +    and      r9, r10, r4, lsr #8       //ag = (dst>>8) & mask
> +
> +    mul      r11, r8, r7               //RB = rb * scale
> +    mul      r4, r9, r7                //AG = ag * scale
> +
> +    // combine RB and AG
> +    and      r11, r10, r11, lsr #8     //r8 = (RB>>8) & mask
> +    and      r4, r4, r10, lsl #8       //r9 = AG & ~mask
> +    orr      r4, r4, r11
> +
> +    // add src to combined value
> +    add      r5, r5, r3
> +    add      r6, r6, r4
> +
> +.Lblitrow32_loop_cond:
> +    subs     r2, r2, #2
> +    stm      r0!, {r5, r6}
> +
> +    bge      .Lblitrow32_double_loop
> +
> +.Lblitrow32_single_loop:
> +    adds     r2, #1
> +    blo      .Lexit
> +
> +    ldr      r3, [r0]
> +    ldr      r5, [r1], #4
> +
> +    cmp      r3, #0
> +    beq      .Lblitrow32_single_store
> +
> +    lsr      r7, r5, #24               //extract alpha
> +    and      r8, r3, r10               //rb = (dst & mask)
> +    rsb      r7, r7, #256              //r5 = scale = (255-alpha)+1
> +    and      r9, r10, r3, lsr #8       //ag = (dst>>8) & mask
> +
> +    mul      r8, r8, r7                //RB = rb * scale
> +    mul      r9, r9, r7                //AG = ag * scale
> +
> +    // combine RB and AG
> +    and      r8, r10, r8, lsr #8       //r8 = (RB>>8) & mask
> +    and      r9, r9, r10, lsl #8       //r9 = AG & ~mask
> +    orr      r3, r8, r9

This code does
    (x * (255 - a + 1)) / 256
instead of
    (x * (255 - a) + 127) / 255

It's not necessarily bad, just trades accuracy for performance. At
least the corner opaque/transparent cases are handled normally.

> +    add      r5, r5, r3                //add src to combined value

And here pixman normally does saturated addition. The differences show
up when at least one of the R/G/B color components of the source
pixel is bigger than alpha. This is described in "Additive and blended
alpha in a single operation" part of
    http://home.comcast.net/~tom_forsyth/blog.wiki.html#[[Premultiplied%20alpha]]

The older discussion in the pixman mailing list was here:
    http://comments.gmane.org/gmane.comp.graphics.pixman/454

All modern processors have no problems with saturated addition
(that's handled by UQADD8 instruction starting from ARMv6). But
emulating saturated addition becomes a serious performance problem
for ARMv5, as it is done with the following code:
    http://cgit.freedesktop.org/pixman/tree/pixman/pixman-combine.h.template?id=pixman-0.27.2#n57
    http://cgit.freedesktop.org/pixman/tree/pixman/pixman-combine.h.template?id=pixman-0.27.2#n209
When having both source and destination pixels split into
RB (00 RR 00 BB) and AG (00 AA 00 GG) parts, performing the
saturated addition for these parts and combining the results
needs 11 ARM instructions in the best case if the compiler does its
job correctly (and often it does not). The code from pixelflinger and
pixman ARMv6 code only uses 2 instructions for this (not to mention
that splitting the source pixel into AG and RB parts is not needed).

Saturated addition emulation is very slow. However verifying
whether the saturated addition is needed can be done a bit faster:
    http://permalink.gmane.org/gmane.comp.graphics.pixman/461

Or in ARM assembly (should work, but I haven't tested it):
    orr     TMP, SRC, #0xFF0000
    cmp     SRC, SRC, lsl #8
    cmphs   TMP, SRC, lsl #16
    cmphs   SRC, SRC, lsl #24
    blo     <we_actually_need_saturated_addition>

That's 5 instructions, which is less overhead than actually
performing saturated addition every time. And if we have a
compositing operation with solid source, this check can be
done just once per function outside the loop.

Everything depends on whether we want to keep the current pixman
behaviour also on ARMv5 and just minimize the performance hit.
Or cut the corners in the same way as pixelflinger.

> +.Lblitrow32_single_store:
> +    str      r5, [r0], #4
> +
> +.Lexit:
> +    pop      {r4-r11}
> +    bx       lr

> #if defined(USE_ARM_SIMD) || defined(USE_ARM_NEON) || defined(USE_ARM_IWMMXT)

This also needs "|| defined(USE_ARM_V5)" or the code will fail to
compile if you disable arm-simd, arm-neon and arm-iwmmxt, but enable
just arm-v5.

> @@ -168,6 +169,8 @@ detect_cpu_features (void)
> 		    features |= (ARM_V7 | ARM_V6);
> 		else if (strncmp (plat, "v6l", 3) == 0)
> 		    features |= ARM_V6;
>+		else if (strncmp (plat, "v5l", 3) == 0)
>+		    features |= ARM_V5;
> 	    }
> 	}

The ARMv6 and ARMv7 processors also can execute ARMv5 enhanced dsp
instructions, but you are setting ARM_V5 bit only for "v5l".
And to be more correct, this should be a check not for ARMv5, but
for "edsp" feature.

Also the assembly file should have ".func" / ".endfunc" directives
around functions (otherwise thumb interworking can be broken) and mark
the stack as non-executable (potential security issue).

BTW, the nearest scaling code from pixman-arm-simd-asm.S can be
moved to the arm-v5 file if pixman gets ARMv5 support.

-- 
Best regards,
Siarhei Siamashka