[Pixman] Pixman library on wince arm

PS block111 at mail.ru
Tue Sep 28 21:22:34 PDT 2010

> Thu, 23 Sep 2010 01:42:43 +0400 письмо от PS <block111 at mail.ru>:
> > > > The question I have is simple: the function
> > pixman_composite_over_n_8888_asm_##cpu takes height as the second (AFAIK)
> > parameter, it has some logic to handle height and line stride. Is there a
> > function that does the same but operates on single lines? That function
> > wouldn't need to know dst_stride and height, all I'd need to pass
> is
> > the dst_line, width, and color.
> > > 
> > > Just pass in a height of 1. The stride will then be irrelevant.
> > > 
> > > In any case, I think you may be overcomplicating things. All the
> > > things that Pixman is designed to do are accessible from the
> > > definitions in <pixman.h>.
> > > 
> > >  - Jonathan Morton
> > > 
> > Off course, I pass 1 for height :) but then there is almost no
> performance
> > improvement from that neon-asm code. Regular c function (inlined
> > while(bits<bits_end){ ...} loop) is only around 10-15% slower.
> > It took me like 5 minutes to create
> > pixman_composite_line_over_n_8888_asm_sse2(int32_t width, uint32_t *
> dst_line,
> > uint32_t src) from sse2_composite_over_n_8888, but it's kind of
> unclear
> > how I could write similar function for arm-neon
> I think I didn't search enough (too lazy to scroll a few more lines down
> from .macro generate_composite_function).
> The next macro in pixman-arm-neon-asm.h is what I was asking about, seems that
> it does what I really need (probably these pixman combine functions deal with
> cases like mine):
> /*
> * A simplified variant of function generation template for a single
> * scanline processing (for implementing pixman combine functions)
> */
> .macro generate_composite_function_single_scanline
> ...
> .

I used arm neon code from pixman as a reference to see how I could optimize my code. The most useful function in my case would be pixman_composite_scanline_over_n_8888 (could be generetated with generate_composite_function_single_scanline).
I reviewed asm code generated with it and made some changes to adapt it to my needs. I think similar changes could be useful in pixman.
The first change I made it to accept only pointer and count of pixels, and I made a separate function to set the color (q0 and q1 registers).
The neon specific change I made is:
   q0 contains 0x11aaaaaa (where aa is the 0xff-alpha), 
   q1 contains 16-bit rgb values of the src color premultiplied by alpha as follows (color * alpha + 0x80). q1 is set to contain 4 16-bit values: 0 for alpha, and the other for rgb values.

Here's what pixman produces now:

vuzp.8      d4, d5 
vuzp.8      d6, d7 
vuzp.8      d5, d7 
vuzp.8      d4, d6 

vmvn        d24, d3 
vmull.u8    q8, d24, d4 
vmull.u8    q9, d24, d5 
vmull.u8    q10, d24, d6 
vmull.u8    q11, d24, d7 

vrshr.u16   q14, q8, #8 
vrshr.u16   q15, q9, #8 
vrshr.u16   q12, q10, #8 
vrshr.u16   q13, q11, #8 

vraddhn.i16 d28, q14, q8 
vraddhn.i16 d29, q15, q9 
vraddhn.i16 d30, q12, q10 
vraddhn.i16 d31, q13, q11 

vqadd.u8    q14, q0, q14 
vqadd.u8    q15, q1, q15 

vzip.8      d28, d30 
vzip.8      d29, d31 
vzip.8      d30, d31 
vzip.8      d28, d29 

I do this kind of operations:
	vld1.32     {q2, q3}, [r0, :128]
	vmull.u8    q4, d4, d0
	vmull.u8    q5, d5, d0
	vmull.u8    q6, d6, d0
	vmull.u8    q7, d7, d0

	vaddhn.i16	d16, q4, q1   //<--- should not be vraddhn
	vaddhn.i16	d17, q5, q1
	vaddhn.i16	d18, q6, q1
	vaddhn.i16	d19, q7, q1

	vst1.32     {q8, q9}, [r0, :128]!

This way it uses less neon register (q12-q15 in vrshr step) and doesn't need to interleave/deinterleave color components and has way fewer instructions.
It also produces results closer to what floating point alpha blending would produce. The only possible drawback could be if the src color wasn't properly premultiplied then color values would overflow. Since I added separate function to set fill color then I do all premultiplication in that function and I know that colors will never overflow.
I copied simplified code, without all interleaving for better readability. Since neon code in pixman is heavily templated (macro in asm), I wasn't able to add my change as is to that code. Hopefully my change could be used in pixman as well.

More information about the Pixman mailing list