[Pixman] Pixman library on wince arm
Siarhei Siamashka
siarhei.siamashka at gmail.com
Wed Sep 29 08:02:25 PDT 2010
On Wednesday 29 September 2010 07:22:34 PS wrote:
> I used arm neon code from pixman as a reference to see how I could optimize
> my code.
Thanks, any additional review of NEON code is always welcome. At least to be
sure that I'm not doing something really stupid there without noticing.
> The most useful function in my case would be
> pixman_composite_scanline_over_n_8888 (could be generetated with
> generate_composite_function_single_scanline). I reviewed asm code
> generated with it and made some changes to adapt it to my needs. I think
> similar changes could be useful in pixman. The first change I made it to
> accept only pointer and count of pixels, and I made a separate function to
> set the color (q0 and q1 registers). The neon specific change I made is:
> q0 contains 0x11aaaaaa (where aa is the 0xff-alpha),
> q1 contains 16-bit rgb values of the src color premultiplied by alpha as
> follows (color * alpha + 0x80). q1 is set to contain 4 16-bit values: 0
> for alpha, and the other for rgb values.
>
> Here's what pixman produces now:
>
> vuzp.8 d4, d5
> vuzp.8 d6, d7
> vuzp.8 d5, d7
> vuzp.8 d4, d6
>
> vmvn d24, d3
> vmull.u8 q8, d24, d4
> vmull.u8 q9, d24, d5
> vmull.u8 q10, d24, d6
> vmull.u8 q11, d24, d7
>
> vrshr.u16 q14, q8, #8
> vrshr.u16 q15, q9, #8
> vrshr.u16 q12, q10, #8
> vrshr.u16 q13, q11, #8
>
> vraddhn.i16 d28, q14, q8
> vraddhn.i16 d29, q15, q9
> vraddhn.i16 d30, q12, q10
> vraddhn.i16 d31, q13, q11
>
> vqadd.u8 q14, q0, q14
> vqadd.u8 q15, q1, q15
>
> vzip.8 d28, d30
> vzip.8 d29, d31
> vzip.8 d30, d31
> vzip.8 d28, d29
Actually the inner loop for this operation looks like this:
23d8: f3d8c270 vrshr.u16 q14, q8, #8
23dc: f3d8e272 vrshr.u16 q15, q9, #8
23e0: f3d88274 vrshr.u16 q12, q10, #8
23e4: f3d8a276 vrshr.u16 q13, q11, #8
23e8: f3ccc4a0 vraddhn.i16 d28, q14, q8
23ec: f3ced4a2 vraddhn.i16 d29, q15, q9
23f0: f3c8e4a4 vraddhn.i16 d30, q12, q10
23f4: f3caf4a6 vraddhn.i16 d31, q13, q11
23f8: f340c07c vqadd.u8 q14, q0, q14
23fc: f342e07e vqadd.u8 q15, q1, q15
2400: f426402d vld4.8 {d4-d7}, [r6, :128]!
2404: f442c02d vst4.8 {d28-d31}, [r2, :128]!
2408: f3f08583 vmvn d24, d3
240c: f3c80c84 vmull.u8 q8, d24, d4
2410: f3c82c85 vmull.u8 q9, d24, d5
2414: f3c84c86 vmull.u8 q10, d24, d6
2418: f3c86c87 vmull.u8 q11, d24, d7
241c: e2500008 subs r0, r0, #8
2420: aaffffec bge 23d8
The instructions VUZP/VZIP are only used for handling leading/trailing
pixels by the automatically expanded macros because directly using vld4/vst4
instructions is problematic there (we don't want full 64-bit/128-bit
load/stores but need to be able to do this up to a single pixel granularity).
This is also described in the comments in the pixman neon source files.
You are right about having redundant VMVN instruction (this could have been
precalculated outside of the main loop) and also about deinterleaving which is
really unnecessary here. The explanation is that this particular code reuses
macros from the other over_8888_8888 operation, that's why it has this
redundancy. Also instructions scheduling is not optimal for Cortex-A8 pipeline
and can be definitely improved. Some more details about how to get the best
performance out of it can be found here:
http://lists.cairographics.org/archives/cairo/2009-October/018434.html
The whole pixman neon code was designed so that it should:
a) provide an easy way to implement any non-transformed fast paths for 2D
compositing operations in a very short time (lazy implementations)
b) provide the possibility to fine tune the inner loop for perfect instructions
scheduling if really needed (optimized implementations)
c) provide efficient software prefetch for getting the best out of the
available memory bandwidth (ARM Cortex-A8 does not have a hardware prefetcher)
The code also has comments regarding the differences between "lazy" and
"optimized" implementations, you can refer to it to see some examples.
While implementing all of this, it was discovered that ARM Cortex-A8 processor
from OMAP3 chip used in N900 phone has relatively slow memory performance which
makes advantages of "optimized" over "lazy" implementations practically
insignificant (they are mostly only beneficial for working with the data in L1
or L2 cache). This effectively killed all the motivation of investing extra
efforts into providing good instructions scheduling for the less commonly used
operations. I really would like to get some other NEON capable hardware with
ARM Cortex-A5 or ARM Cortex-A9 processor eventually, out of curiosity just to
see if it could benefit from "optimized" implementations more.
This particular over_n_8888 neon function happens to be a "lazy" one (and
marked as such by a comment in the code). I also forgot to put prefetch macro
in the main loop and it is something that really should be fixed to get better
performance (a patch will follow).
If you are really interested to see how much this function can be improved, I
can provide a fully optimized implementation a bit later.
I also wonder if the partially optimized functions would benefit from a special
suffix in the function name so that it could be possible to easily distinguish
them in the profiling log? If they show up high on some use case, it would make
sense to look if they can be improved more.
> I do this kind of operations:
> vld1.32 {q2, q3}, [r0, :128]
> vmull.u8 q4, d4, d0
> vmull.u8 q5, d5, d0
> vmull.u8 q6, d6, d0
> vmull.u8 q7, d7, d0
>
> vaddhn.i16 d16, q4, q1 //<--- should not be vraddhn
> vaddhn.i16 d17, q5, q1
> vaddhn.i16 d18, q6, q1
> vaddhn.i16 d19, q7, q1
>
> vst1.32 {q8, q9}, [r0, :128]!
>
>
> This way it uses less neon register (q12-q15 in vrshr step) and doesn't
> need to interleave/deinterleave color components and has way fewer
> instructions. It also produces results closer to what floating point alpha
> blending would produce. The only possible drawback could be if the src
> color wasn't properly premultiplied then color values would overflow.
> Since I added separate function to set fill color then I do all
> premultiplication in that function and I know that colors will never
> overflow. I copied simplified code, without all interleaving for better
> readability. Since neon code in pixman is heavily templated (macro in
> asm), I wasn't able to add my change as is to that code. Hopefully my
> change could be used in pixman as well.
One simple validation whether your code is correct is to check that using
0xFFFFFFFF as the source should write 0xFFFFFFFF to the destination buffer.
And using 0x00000000 as the source should keep the destination unchanged.
The relationship between integer and floating point values as color components
is such that 0x00 is mapped to 0.0f, and 0xFF is mapped to 1.0f. It means
that we actually have to deal with multiplications and divisions by 255, not
256. And this adds a bit of extra complexity.
Pixman currently uses the following math as the core of alpha blending
operation:
x = ((x * a) + (255 / 2)) / 255;
Which is equivalent to the following C code when avoiding really slow division
and taking into account the range of possible input values (and that is what is
directly implemented in NEON assembly too):
t = x * a + 0x80;
x = (t + (t >> 8)) >> 8;
The following link is also an interesting read regarding division by constants
which are not powers of two:
http://research.swtch.com/2008/01/division-via-multiplication.html
--
Best regards,
Siarhei Siamashka
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 198 bytes
Desc: This is a digitally signed message part.
URL: <http://lists.freedesktop.org/archives/pixman/attachments/20100929/1882668e/attachment.pgp>
More information about the Pixman
mailing list