[cairo] [PATCH/RFC] pixman ARM NEON optimizations (now about performance)

Mon Jul 27 03:26:32 PDT 2009

Hello,

I just have made a simple low level performance benchmark program for pixman
('lowlevel-blt-bench'). The results of quickly running it on different
platforms are the following:

1.86GHz pentium-m laptop (gcc 4.3.2):
  http://pastebin.com/f10a2cc7 

playstation 3 (linux, gcc 4.4.1):
  http://pastebin.com/facadcb0

500MHz ARM Cortex-a8, beagleboard, 1280x1024-16 at 57 (gcc 4.4.1): 
  http://pastebin.com/f822cb2a (note: over_8888_0565 uses my optimizations)

Some quick ARM related notes:
1. Pixman dispatch logic overhead is quite high. I added an ugly hack
to the benchmark program to lookup NEON fastpath tables and also make calls
to the blitters directly. For the L1 test (running completely in L1 cache
to benchmarking inner loops), the overhead of pixman code varies between 30%
and more than 2x slowdown. Surely pixman does some extra necessary stuff like
clipping boundary checks. But parts like the linear search in fastpath tables
are probably not the best for the performance (and this is going to get worse
when more optimized functions get added to the tables). This needs better
investigation though to see where most of the time is spent.

2. Performance can be improved in many areas. For example, src_0565_0565 (NEON
optimized by Jonathan) is utilizing memory bandwidth only at about ~60% on
OMAP3. The nonoptimal choice of prefetch method is most likely at fault. It's
important to note that I'm using NEON optimized memcpy in my system, so it can
be used as a base for comparisons here.

--------------------------------------------------------------------------

I started to implement NEON pixman optimizations myself and
used 'over_8888_0565' function as a subject for the initial experiments.
This code is available at:
http://cgit.freedesktop.org/~siamashka/pixman/log/?h=arm-neon-preview
http://cgit.freedesktop.org/~siamashka/pixman/commit/?id=7de602bdb45d682b3c0f17aac59ce03b7ae1fcd2

The code is scheduled according to Cortex-A8 timings and uses pipelining in
the inner loop. The performance is quite good, but maybe one or two cycles
still can be squeezed. This code even outperforms over_8888_0888 function for
L1 cached case in spite of having to additionally do 32bpp<->16bpp conversion.
Prefetch is simply done 128/65 bytes ahead, this works reasonably good for
very large images, images which have matching width and stride
(overprefetching the first line is conveniently a valid prefetch for the next
line) or when blitting many small images repeatedly advancing from left to
right. An interesting thing is that we have a heavily underused ARM pipeline
while NEON unit is doing all the work. It is possible to implement quite a
complex prefetch logic using ARM instructions, and it would have zero impact
on the performance. I have some interesting stuff already and will post about
the "quest for the ideal prefetcher" in the next e-mail.

I also tried to enable and benchark Jonathan's 'over_8888_0565' NEON optimized
function even though it has come problems with correctness. Benchmarks on
OMAP3 SoC & Cortex-A8 core give results which are, to put it mildly,
non-impressive at all:

over_8888_0565 - L1:  55.90 M: 21.04 HT: 17.91 VT: 18.07 R: 14.68
over_8888_0565 = L1:  47.86 M: 21.03 HT: 16.64 VT: 16.88 R: 13.83

While even default C code has the following numbers:

over_8888_0565 = L1:  32.06 M: 24.87 HT: 18.64 VT: 18.19 R: 17.24

I assume that Jonathan just may have targeted a different SoC with different
memory subsystem, and tried to optimize for a very specific use case (alpha
blending in non-cached framebuffer). But I myself am definitely biased to
primarily have good performance on OMAP3/Cortex-A8 for cairo's 'image'
backend :-)

Surely benchmarks can be run on other SoC's just to make sure that pixman is
fast everywhere. In the cases if different systems require essentially
different optimizations, probably separate variants of the functions can be
supported. I would appreciate if somebody could test the following with the
other NEON capable ARM chips and post the results somewhere:

$ git clone git://anongit.freedesktop.org/~siamashka/pixman
$ git checkout -b arm-neon-preview origin/arm-neon-preview

Make sure to configure pixman with --disable-shared option (it will help to
avoid some problems if you want to copy and run the resulting binary on the
device).

Then run:
$ test/blitters-test (to make sure that there are no corectness problems)
and
$ test/lowlevel-blt-bench

Once we clarify all the details, I can make NEON optimizations for all the
performance critical functions in pixman.

-- 
Best regards,
Siarhei Siamashka
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 189 bytes
Desc: This is a digitally signed message part.
Url : http://lists.cairographics.org/archives/cairo/attachments/20090727/212561f3/attachment.pgp