[Pixman] [PATCH] ARMv5 optimisations
Siarhei Siamashka
siarhei.siamashka at gmail.com
Sat Sep 29 16:46:57 PDT 2012
On Thu, 27 Sep 2012 15:18:47 +1200
Andre Renaud <andre at bluewatersys.com> wrote:
> Hi,
> I've put together an ARMv5 optimised blitter & fill routine (heavily
> based on some code from the Android pixelflinger system). It provides
> 16 & 32-bit fills, over 8888 -> 8888 blitting and over 8888 -> 0565
> blitting. I realise that ARMv5 is getting pretty old now, but it's
> probably still popular enough that someone else might want this.
Hi. Thanks for sharing this information and patches. The benchmark
numbers for pixelflinger code are especially interesting.
However running "make check" shows that this fails some pixman
correctness expectations:
PASS: a1-trap-test
PASS: pdf-op-test
PASS: region-test
PASS: region-translate-test
PASS: fetch-test
rotate test passed (checksum=03A24D51)
PASS: rotate-test
PASS: oob-test
PASS: infinite-loop
PASS: trap-crasher
PASS: alpha-loop
PASS: scaling-crash-test
PASS: scaling-helpers-test
PASS: gradient-crash-test
region_contains test passed (checksum=D2BF8C73)
PASS: region-contains-test
PASS: alphamap
PASS: stress-test
composite traps test passed (checksum=E3112106)
PASS: composite-traps-test
blitters test failed! (checksum=03E77E23, expected 3E1DD2E8)
FAIL: blitters-test
glyph test passed (checksum=741CB2DB)
PASS: glyph-test
scaling test failed! (checksum=5BC5CFDC, expected 03A23E0C)
FAIL: scaling-test
affine test failed! (checksum=12AA0D73, expected 74050F50)
FAIL: affine-test
---- Test 4555550 failed ----
Operator: OVER
Source: a8r8g8b8, 1x1
Destination: a8r8g8b8, 1x1
R G B A Rounded
Source color: 1.000 1.000 1.000 0.000 1.000 1.000 1.000 0.000
Dest. color: 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000
Expected: 1.000 1.000 1.000 1.000
Got: 255 255 254 0 [pixel: 0x00fffffe]
Min accepted: 254 254 254 254
Max accepted: 256 256 256 256
Test 0x0045831E failed.
FAIL: composite
=============================================
4 of 22 tests failed
> I've attached the patch here (please excuse the lack of inline patch -
> I'll blame that on a combination of my lack of git knowledge, and
> gmails inability to do text emails without line wrapping).
There is a useful manual about git here:
http://www.kernel.org/pub/software/scm/git/docs/user-manual.html#patch-series
Use "git format-patch --cover-letter HEAD^^^^" to create the patches.
The number of '^' characters above specifies the number of patches in
your series you want to submit. Edit the cover letter and then use
"git send-email --to=pixman at lists.freedesktop.org *.patch" to send the
patches to the list.
Github also has some introductory documentation. Nowadays it's
difficult to do software development without being familiar with git.
It is used by way too many open source projects :)
> As the code comes from the Android code base, it is Apache licensed -
> is this a problem with Pixman?
I am not a lawyer, but Apache license does not seem to be
compatible with GPL2 according to:
http://www.apache.org/licenses/GPL-compatibility.html
This is an extra restriction, which does not exist in the current
pixman MIT/X11 license. I personally would not touch Apache licensed
code without a really good reason.
> For performance results, here are the relevant numbers (tested on an
> AT91SAM9G45 - ARM926EJS). The first set of the pair is before the
> change, the second is after.
>
> reference memcpy speed = 230.1MB/s (57.5MP/s for 32bpp fills)
>
> src_n_0565 = L1: 19.33 L2: 19.75 M: 19.74 ( 17.16%)
> HT: 15.77 VT: 15.27 R: 15.25 RT: 7.60 ( 46Kops/s)
> src_n_0565 = L1: 195.58 L2: 126.99 M:128.46 (110.53%)
> HT: 46.37 VT: 45.73 R: 42.87 RT: 11.26 ( 50Kops/s)
>
> src_n_x888 = L1: 19.32 L2: 19.75 M: 19.78 ( 34.39%)
> HT: 15.79 VT: 15.38 R: 15.31 RT: 7.64 ( 48Kops/s)
> src_n_x888 = L1: 111.67 L2: 66.34 M: 65.85 (113.33%)
> HT: 36.00 VT: 35.79 R: 33.80 RT: 11.09 ( 50Kops/s)
>
> src_n_8888 = L1: 19.33 L2: 19.75 M: 19.78 ( 34.39%)
> HT: 15.80 VT: 15.39 R: 15.32 RT: 7.64 ( 48Kops/s)
> src_n_8888 = L1: 109.40 L2: 66.36 M: 65.89 (113.39%)
> HT: 36.03 VT: 35.90 R: 33.81 RT: 11.11 ( 50Kops/s)
>
> over_8888_0565 = L1: 5.08 L2: 4.49 M: 4.48 ( 11.68%)
> HT: 2.80 VT: 2.71 R: 2.69 RT: 2.22 ( 23Kops/s)
> over_8888_0565 = L1: 20.96 L2: 14.48 M: 14.63 ( 37.78%)
> HT: 8.95 VT: 8.60 R: 8.19 RT: 4.48 ( 37Kops/s)
>
> over_8888_8888 = L1: 3.21 L2: 3.17 M: 4.22 ( 14.68%)
> HT: 6.20 VT: 6.58 R: 6.63 RT: 4.28 ( 36Kops/s)
> over_8888_8888 = L1: 21.84 L2: 12.40 M: 12.88 ( 44.32%)
> HT: 8.77 VT: 8.85 R: 8.84 RT: 4.68 ( 38Kops/s)
>
> over_8888_x888 = L1: 9.24 L2: 8.25 M: 8.12 ( 28.22%)
> HT: 6.83 VT: 6.77 R: 6.67 RT: 4.29 ( 36Kops/s)
> over_8888_x888 = L1: 21.83 L2: 12.39 M: 12.88 ( 44.32%)
> HT: 8.76 VT: 8.85 R: 8.84 RT: 4.68 ( 38Kops/s)
As for the patch, there are some comments:
> +dnl Check for ARMv5
> +
> +AC_ARG_ENABLE(arm-v5,
> + [AC_HELP_STRING([--disable-arm-v5],
> + [disable ARM V5 fast paths])],
> + [enable_arm_v5=$enableval], [enable_arm_v5=no])
> +
> +if test $enable_arm_v5 = no ; then
> + have_arm_v5=disabled
> +else
> + have_arm_v5=yes
> +fi
> +
> +if test $have_arm_v5 = yes ; then
> + AC_DEFINE(USE_ARM_V5, 1, [use ARM v5 assembly optimizations])
> +fi
> +
> +AM_CONDITIONAL(USE_ARM_V5, test $have_arm_v5 = yes)
> +
> +AC_MSG_CHECKING(whether to use ARM V5 assembler)
> +AC_MSG_RESULT($have_arm_v5)
Looks like arm-v5 support is disabled by default unless overrided by
configure option. This is not consistent with the rest of the cpu
specific optimizations (arm-simd, arm-neon). Normally everything is
compiled in, but only gets used when the needed cpu features are
detected at runtime.
> +pixman_arm_blitrow_32:
> +
> + push {r4-r11}
> +/*
> + * r0 - dst
> + * r1 - src
> + * r2 - count
> + */
> +.Lresidual_loop:
> + mov r10, #0xFF
> + orr r10, r10, r10, lsl #16 //mask = r10 = 0x00FF00FF
> +
> + subs r2, r2, #2
> + blt .Lblitrow32_single_loop
> +
> +.Lblitrow32_double_loop:
> + ldm r0, {r3, r4}
> + ldm r1!, {r5, r6}
> +
> + orrs r9, r3, r4
> + beq .Lblitrow32_loop_cond
> +
> + // First iteration
> + lsr r7, r5, #24 //extract alpha
> + and r8, r3, r10 //rb = (dst & mask)
> + rsb r7, r7, #256 //r5 = scale = (255-alpha)+1
> + and r9, r10, r3, lsr #8 //ag = (dst>>8) & mask
> +
> + mul r11, r8, r7 //RB = rb * scale
> + mul r3, r9, r7 //AG = ag * scale
> +
> + // combine RB and AG
> + and r11, r10, r11, lsr #8 //r8 = (RB>>8) & mask
> + and r3, r3, r10, lsl #8 //r9 = AG & ~mask
> +
> + lsr r7, r6, #24 //extract alpha for second
> iteration
> + orr r3, r3, r11
> +
> + // Second iteration
> + and r8, r4, r10 //rb = (dst & mask)
> + rsb r7, r7, #256 //r5 = scale = (255-alpha)+1
> + and r9, r10, r4, lsr #8 //ag = (dst>>8) & mask
> +
> + mul r11, r8, r7 //RB = rb * scale
> + mul r4, r9, r7 //AG = ag * scale
> +
> + // combine RB and AG
> + and r11, r10, r11, lsr #8 //r8 = (RB>>8) & mask
> + and r4, r4, r10, lsl #8 //r9 = AG & ~mask
> + orr r4, r4, r11
> +
> + // add src to combined value
> + add r5, r5, r3
> + add r6, r6, r4
> +
> +.Lblitrow32_loop_cond:
> + subs r2, r2, #2
> + stm r0!, {r5, r6}
> +
> + bge .Lblitrow32_double_loop
> +
> +.Lblitrow32_single_loop:
> + adds r2, #1
> + blo .Lexit
> +
> + ldr r3, [r0]
> + ldr r5, [r1], #4
> +
> + cmp r3, #0
> + beq .Lblitrow32_single_store
> +
> + lsr r7, r5, #24 //extract alpha
> + and r8, r3, r10 //rb = (dst & mask)
> + rsb r7, r7, #256 //r5 = scale = (255-alpha)+1
> + and r9, r10, r3, lsr #8 //ag = (dst>>8) & mask
> +
> + mul r8, r8, r7 //RB = rb * scale
> + mul r9, r9, r7 //AG = ag * scale
> +
> + // combine RB and AG
> + and r8, r10, r8, lsr #8 //r8 = (RB>>8) & mask
> + and r9, r9, r10, lsl #8 //r9 = AG & ~mask
> + orr r3, r8, r9
This code does
(x * (255 - a + 1)) / 256
instead of
(x * (255 - a) + 127) / 255
It's not necessarily bad, just trades accuracy for performance. At
least the corner opaque/transparent cases are handled normally.
> + add r5, r5, r3 //add src to combined value
And here pixman normally does saturated addition. The differences show
up when at least one of the R/G/B color components of the source
pixel is bigger than alpha. This is described in "Additive and blended
alpha in a single operation" part of
http://home.comcast.net/~tom_forsyth/blog.wiki.html#[[Premultiplied%20alpha]]
The older discussion in the pixman mailing list was here:
http://comments.gmane.org/gmane.comp.graphics.pixman/454
All modern processors have no problems with saturated addition
(that's handled by UQADD8 instruction starting from ARMv6). But
emulating saturated addition becomes a serious performance problem
for ARMv5, as it is done with the following code:
http://cgit.freedesktop.org/pixman/tree/pixman/pixman-combine.h.template?id=pixman-0.27.2#n57
http://cgit.freedesktop.org/pixman/tree/pixman/pixman-combine.h.template?id=pixman-0.27.2#n209
When having both source and destination pixels split into
RB (00 RR 00 BB) and AG (00 AA 00 GG) parts, performing the
saturated addition for these parts and combining the results
needs 11 ARM instructions in the best case if the compiler does its
job correctly (and often it does not). The code from pixelflinger and
pixman ARMv6 code only uses 2 instructions for this (not to mention
that splitting the source pixel into AG and RB parts is not needed).
Saturated addition emulation is very slow. However verifying
whether the saturated addition is needed can be done a bit faster:
http://permalink.gmane.org/gmane.comp.graphics.pixman/461
Or in ARM assembly (should work, but I haven't tested it):
orr TMP, SRC, #0xFF0000
cmp SRC, SRC, lsl #8
cmphs TMP, SRC, lsl #16
cmphs SRC, SRC, lsl #24
blo <we_actually_need_saturated_addition>
That's 5 instructions, which is less overhead than actually
performing saturated addition every time. And if we have a
compositing operation with solid source, this check can be
done just once per function outside the loop.
Everything depends on whether we want to keep the current pixman
behaviour also on ARMv5 and just minimize the performance hit.
Or cut the corners in the same way as pixelflinger.
> +.Lblitrow32_single_store:
> + str r5, [r0], #4
> +
> +.Lexit:
> + pop {r4-r11}
> + bx lr
> #if defined(USE_ARM_SIMD) || defined(USE_ARM_NEON) || defined(USE_ARM_IWMMXT)
This also needs "|| defined(USE_ARM_V5)" or the code will fail to
compile if you disable arm-simd, arm-neon and arm-iwmmxt, but enable
just arm-v5.
> @@ -168,6 +169,8 @@ detect_cpu_features (void)
> features |= (ARM_V7 | ARM_V6);
> else if (strncmp (plat, "v6l", 3) == 0)
> features |= ARM_V6;
>+ else if (strncmp (plat, "v5l", 3) == 0)
>+ features |= ARM_V5;
> }
> }
The ARMv6 and ARMv7 processors also can execute ARMv5 enhanced dsp
instructions, but you are setting ARM_V5 bit only for "v5l".
And to be more correct, this should be a check not for ARMv5, but
for "edsp" feature.
Also the assembly file should have ".func" / ".endfunc" directives
around functions (otherwise thumb interworking can be broken) and mark
the stack as non-executable (potential security issue).
BTW, the nearest scaling code from pixman-arm-simd-asm.S can be
moved to the arm-v5 file if pixman gets ARMv5 support.
--
Best regards,
Siarhei Siamashka
More information about the Pixman
mailing list