[Pixman] Pixman library on wince arm

Wed Sep 29 13:24:36 PDT 2010

On Wednesday 29 September 2010 19:57:05 PS wrote:
> Wed, 29 Sep 2010 18:02:25 +0300 письмSiarhei Siamashka 
<siarhei.siamashka at gmail.com>:
> > While implementing all of this, it was discovered that ARM Cortex-A8
> > processor from OMAP3 chip used in N900 phone has relatively slow memory
> > performance which
> > makes advantages of "optimized" over "lazy"
> > implementations practically
> > insignificant (they are mostly only beneficial for working with the data
> > in L1 or L2 cache). This effectively killed all the motivation of
> > investing extra efforts into providing good instructions scheduling for
> > the less commonly used operations. I really would like to get some other
> > NEON capable hardware with ARM Cortex-A5 or ARM Cortex-A9 processor
> > eventually, out of curiosity just to see if it could benefit from
> > "optimized" implementations more.
> 
> I have HTC HD2, it has snapdragon cpu that runs neon. If you want I can run
> two different functions on my test and I can tell what my difference is. I
> wrote test that uses real polygon that I draw in my app and then I run
> different functions to see what is the performance difference. So far, my
> function without any optimization, scheduling improvement and without
> optimization for leading/trailing pixels performs best (I don't try to
> zip/unzip pixels, I simply process 1, 2, 4 pixels at a time to get 8 pixel
> alignment). In my test there are many short scanlines, and my function
> would benefit from that kind optimization for sure. What I also noticed is
> that mode or less optimized c code compiled with ms compiler for arm
> performs extremely well. After I changed my c function it performs only
> like 15-20% slower than neon optimized code. I have no idea how it's
> possible, but I guess the cpu is really well designed and runs plain arm
> code extremely well. The microsoft compiler has no neon capabilities, it
> generates plain armv4/v5.

There could be many reasons why you see more or less the same performance
with and without NEON. It could be that your system is also heavily memory
performance limited, resulting in both C and NEON running equally slow.
You may be interested to find out what's the matter by running some
microbenchmarks.

> My code uses 16-bit premultiplied color values and as a result there is less
> rounding error involved. That rounding error in premultiplying color
> components by alpha happen before the pixman function.

Yes, pixman only works with premultiplied formats, so some of the precision is
already lost even before calling this "over_n_8888" function.

Actually pixman could get better precision and performance for a different
"over_x888_8_8888" operation by avoiding intermediate rounding and using
something like:

    return (dst_color * (255 - alpha) + color * alpha + 127) / 255;

instead of:

    color = (color * alpha + 127) / 255; /* premultiply color */
    return (dst_color * (255 - alpha) + 127) / 255 + color;

But this will cause some troubles for the correctness tests because single
pass fast path functions would provide better precision than generic
fetch->process->store pipeline. That's one example when generally good
things make something else a bit more difficult to do right.

A similar issue is related to r5g6b5 format support. Some precision and
performance is lost because of temporarily format conversions to a8r8g8b8
(to be usable in the 'process' stage). But this one is not so difficult to
solve because pixman tests could be modified to require exact checksum
match only for a8r8g8b8/x8r8g8b8 format. Then separately verify correctness of
conversion between a8r8g8b8/x8r8g8b8 and the other formats.  And finally
check the other types of compositing operations using similar a8r8g8b8/x8r8g8b8
operations as a reference and allowing some difference in the least significant
bit.

> I wrote a test, for all possible inputs (256^3) I calculated resulting color
> using floating point math and then calculated the same color using different
> int math functions. For each function I calculated error and the total error
> was accumulated for each function. At the end of the loop I chose function
> that produces less error on average.

I also made such test program and actually attached it here. When executed, it
provides the following results:

$ ./test-precision-over-n-dst
stddev_pixman = 0.405459
stddev_ps     = 0.620871
test-precision-over-n-dst.c:74: main: Assertion `over_n_dst_ps (0xFF, 0xFF, 
0x80) == 0xFF' failed.
Aborted

It shows that your implementation is less precise on average, and also it fails
tests for obvious corner cases. I think pixman does a bit more complex
computations for a reason.

> Surprisingly, it didn't have that right shift+add. But vraddhn instead of
> vaddhn would have the same effect as shr+add (if I understand correctly what
> that R adds to the function behavior). My function would produce values that
> are off by one on some intervals compared to pixman's math, but on that
> interval floating point value is something like xx.49843434 and therefore
> rouding up this value isn't such a big error anyways.

Unless I got something wrong (feel free to correct me), your implementation
seems to be a bit broken to me.

-- 
Best regards,
Siarhei Siamashka
-------------- next part --------------
A non-text attachment was scrubbed...
Name: test-precision-over-n-dst.c
Type: text/x-csrc
Size: 2692 bytes
Desc: not available
URL: <http://lists.freedesktop.org/archives/pixman/attachments/20100929/94fd846e/attachment.c>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 198 bytes
Desc: This is a digitally signed message part.
URL: <http://lists.freedesktop.org/archives/pixman/attachments/20100929/94fd846e/attachment.pgp>