[Pixman] Pixman library on wince arm
PS
block111 at mail.ru
Wed Sep 29 09:57:05 PDT 2010
Wed, 29 Sep 2010 18:02:25 +0300 письмо от Siarhei Siamashka <siarhei.siamashka at gmail.com>:
> On Wednesday 29 September 2010 07:22:34 PS wrote:
> > I used arm neon code from pixman as a reference to see how I could
> optimize
> > my code.
> Thanks, any additional review of NEON code is always welcome. At least to be
> sure that I'm not doing something really stupid there without noticing.
> Actually the inner loop for this operation looks like this:
> 23d8: f3d8c270 vrshr.u16 q14, q8, #8
> 23dc: f3d8e272 vrshr.u16 q15, q9, #8
> 23e0: f3d88274 vrshr.u16 q12, q10, #8
> 23e4: f3d8a276 vrshr.u16 q13, q11, #8
> 23e8: f3ccc4a0 vraddhn.i16 d28, q14, q8
> 23ec: f3ced4a2 vraddhn.i16 d29, q15, q9
> 23f0: f3c8e4a4 vraddhn.i16 d30, q12, q10
> 23f4: f3caf4a6 vraddhn.i16 d31, q13, q11
> 23f8: f340c07c vqadd.u8 q14, q0, q14
> 23fc: f342e07e vqadd.u8 q15, q1, q15
> 2400: f426402d vld4.8 {d4-d7}, [r6, :128]!
> 2404: f442c02d vst4.8 {d28-d31}, [r2, :128]!
> 2408: f3f08583 vmvn d24, d3
> 240c: f3c80c84 vmull.u8 q8, d24, d4
> 2410: f3c82c85 vmull.u8 q9, d24, d5
> 2414: f3c84c86 vmull.u8 q10, d24, d6
> 2418: f3c86c87 vmull.u8 q11, d24, d7
> 241c: e2500008 subs r0, r0, #8
> 2420: aaffffec bge 23d8
This code is probably from pixman_composite_over_n_8888_asm_neon. The code I posted was from pixman_composite_scanline_over_n_8888_asm_neon (which doesn't exist in pixman and I had generated that function using generate_composite_function_single_scanline macro) and then I removed some unrelated stuff from there (branching and vld/vst) for readability. I just rechecked and you are right, I copied wrong part of code.
in short, neon code in pixman does that:
vrshr.u16 q14, q8, #8
vraddhn.i16 d28, q14, q8
vqadd.u8 q14, q0, q14
...
It's one extra instruction and takes a temporary register. Otherwise the rest is equivalent (that scanline function does not need to preserve any registers on the stack as well).
> The instructions VUZP/VZIP are only used for handling leading/trailing
> pixels by the automatically expanded macros because directly using vld4/vst4
> instructions is problematic there (we don't want full 64-bit/128-bit
> load/stores but need to be able to do this up to a single pixel granularity).
> This is also described in the comments in the pixman neon source files.
> You are right about having redundant VMVN instruction (this could have been
> precalculated outside of the main loop) and also about deinterleaving which is
> really unnecessary here. The explanation is that this particular code reuses
> macros from the other over_8888_8888 operation, that's why it has this
> redundancy. Also instructions scheduling is not optimal for Cortex-A8 pipeline
> and can be definitely improved. Some more details about how to get the best
> performance out of it can be found here:
> http://lists.cairographics.org/archives/cairo/2009-October/018434.html
> The whole pixman neon code was designed so that it should:
> a) provide an easy way to implement any non-transformed fast paths for 2D
> compositing operations in a very short time (lazy implementations)
> b) provide the possibility to fine tune the inner loop for perfect
> instructions
> scheduling if really needed (optimized implementations)
> c) provide efficient software prefetch for getting the best out of the
> available memory bandwidth (ARM Cortex-A8 does not have a hardware prefetcher)
> The code also has comments regarding the differences between "lazy"
> and
> "optimized" implementations, you can refer to it to see some
> examples.
I read through these macros and I understand what they are there for. Otherwise every single case would require manual coding
> While implementing all of this, it was discovered that ARM Cortex-A8 processor
> from OMAP3 chip used in N900 phone has relatively slow memory performance
> which
> makes advantages of "optimized" over "lazy"
> implementations practically
> insignificant (they are mostly only beneficial for working with the data in L1
> or L2 cache). This effectively killed all the motivation of investing extra
> efforts into providing good instructions scheduling for the less commonly used
> operations. I really would like to get some other NEON capable hardware with
> ARM Cortex-A5 or ARM Cortex-A9 processor eventually, out of curiosity just to
> see if it could benefit from "optimized" implementations more.
I have HTC HD2, it has snapdragon cpu that runs neon. If you want I can run two different functions on my test and I can tell what my difference is. I wrote test that uses real polygon that I draw in my app and then I run different functions to see what is the performance difference. So far, my function without any optimization, scheduling improvement and without optimization for leading/trailing pixels performs best (I don't try to zip/unzip pixels, I simply process 1, 2, 4 pixels at a time to get 8 pixel alignment). In my test there are many short scanlines, and my function would benefit from that kind optimization for sure.
What I also noticed is that mode or less optimized c code compiled with ms compiler for arm performs extremely well. After I changed my c function it performs only like 15-20% slower than neon optimized code. I have no idea how it's possible, but I guess the cpu is really well designed and runs plain arm code extremely well. The microsoft compiler has no neon capabilities, it generates plain armv4/v5.
> This particular over_n_8888 neon function happens to be a "lazy" one
> (and
> marked as such by a comment in the code). I also forgot to put prefetch macro
> in the main loop and it is something that really should be fixed to get better
> performance (a patch will follow).
> If you are really interested to see how much this function can be improved, I
> can provide a fully optimized implementation a bit later.
> I also wonder if the partially optimized functions would benefit from a
> special
> suffix in the function name so that it could be possible to easily distinguish
> them in the profiling log? If they show up high on some use case, it would
> make
> sense to look if they can be improved more.
> > I do this kind of operations:
> > vld1.32 {q2, q3}, [r0, :128]
> > vmull.u8 q4, d4, d0
> > vmull.u8 q5, d5, d0
> > vmull.u8 q6, d6, d0
> > vmull.u8 q7, d7, d0
> >
> > vaddhn.i16 d16, q4, q1 //<--- should not be vraddhn
> > vaddhn.i16 d17, q5, q1
> > vaddhn.i16 d18, q6, q1
> > vaddhn.i16 d19, q7, q1
> >
> > vst1.32 {q8, q9}, [r0, :128]!
> >
> >
> > This way it uses less neon register (q12-q15 in vrshr step) and
> doesn't
> > need to interleave/deinterleave color components and has way fewer
> > instructions. It also produces results closer to what floating point
> alpha
> > blending would produce. The only possible drawback could be if the src
> > color wasn't properly premultiplied then color values would overflow.
> > Since I added separate function to set fill color then I do all
> > premultiplication in that function and I know that colors will never
> > overflow. I copied simplified code, without all interleaving for better
> > readability. Since neon code in pixman is heavily templated (macro in
> > asm), I wasn't able to add my change as is to that code. Hopefully my
> > change could be used in pixman as well.
> One simple validation whether your code is correct is to check that using
> 0xFFFFFFFF as the source should write 0xFFFFFFFF to the destination buffer.
> And using 0x00000000 as the source should keep the destination unchanged.
> The relationship between integer and floating point values as color components
> is such that 0x00 is mapped to 0.0f, and 0xFF is mapped to 1.0f. It means
> that we actually have to deal with multiplications and divisions by 255, not
> 256. And this adds a bit of extra complexity.
> Pixman currently uses the following math as the core of alpha blending
> operation:
> x = ((x * a) + (255 / 2)) / 255;
> Which is equivalent to the following C code when avoiding really slow division
> and taking into account the range of possible input values (and that is what
> is
> directly implemented in NEON assembly too):
> t = x * a + 0x80;
> x = (t + (t >> 8)) >> 8;
> The following link is also an interesting read regarding division by constants
> which are not powers of two:
> http://research.swtch.com/2008/01/division-via-multiplication.html
Math in my code would not be correct in this case (but if vraddhn instead of vaddhn was used then probably it would be correct). However, in case of n_over_8888 alpha of 0xff and 0x00 should never execute (it has to be either no-op or plain memcpy)
Since, neon in pixman uses one color premultiplied and shifted at the beginning, then it cannot be as accurate as my code:
pixman does: src = (src*alpha + 0x80) >> 8; result = (result * (0xff - alpha) + result * (0xff - alpha)>>8)>>8 + src;
My code uses 16-bit premultiplied color values and as a result there is less rounding error involved. That rounding error in premultiplying color components by alpha happen before the pixman function.
I wrote a test, for all possible inputs (256^3) I calculated resulting color using floating point math and then calculated the same color using different int math functions. For each function I calculated error and the total error was accumulated for each function. At the end of the loop I chose function that produces less error on average. Surprisingly, it didn't have that right shift+add. But vraddhn instead of vaddhn would have the same effect as shr+add (if I understand correctly what that R adds to the function behavior). My function would produce values that are off by one on some intervals compared to pixman's math, but on that interval floating point value is something like xx.49843434 and therefore rouding up this value isn't such a big error anyways.
More information about the Pixman
mailing list