[Pixman] [PATCH] ARM: NEON: optimization for bilinear scaled 'over 8888 8888'

Mon Apr 4 06:53:11 PDT 2011

On Wed, Mar 16, 2011 at 9:33 AM, Taekyun Kim <podain77 at gmail.com> wrote:
> Performance Benchmark Result on ARM Cortex-A8 (scaling-bench)
>   before : transl: op=3, src=20028888, mask=- dst=20028888, speed=5.58
> MPix/s
>   after :   transl: op=3, src=20028888, mask=- dst=20028888, speed=37.84
> MPix/s
>
>   performance of nearest scaling over for comparison
>               transl: op=3, src=20028888, mask=- dst=20028888, speed=60.73
> MPix/s
>   performance of bilinear scaling src for comparison
>               transl: op=1, src=20028888, mask=- dst=20028888, speed=65.47
> MPix/s

Thanks for the numbers. I just want to point one thing: we really need
to set the bar right to estimate whether we are doing good with these
optimizations or not. For example, 37.84 / 5.58 is a 6.78x
improvement. Looks great, right? On the other hand, one alternative is
to just do bilinear scaling first to a temporary image, followed by
unscaled over operation. You did not provide the numbers for unscaled
over, but we can even take nearest over as an estimate. So this two
step operation would be 1 / (1 / 60.73 + 1 / 65.47) = 31.5 MPix/s,
which gives your optimized code only 37.84 / 31.5 = 1.20x advantage.
Using real unscaled over operation would bring this number down even
more.

So what I'm going to say? If we were to release pixman as it is today
(the code from the current pixman git master), then some of the
bilinear scaling cases would become a lot faster, but still not cover
everything. And also the users of cairo/pixman might find it
beneficial to explicitly split complex operations into separate steps,
which is a bad idea in the long run. One more part of the picture is
the "fetch -> combine -> store" general pipeline. If we provide a
really well optimized NEON bilinear fetcher for this pipeline, then at
least the performance will be reasonable for a lot of operations which
involve bilinear scaling. And after this, any single pass fast path
will have to compete against general pipeline + NEON bilinear fetcher,
which might still be a challenge.

Regarding the "fetch -> combine -> store" pipeline, I have attached a
simple benchmark program, which tries to estimate the overhead of
using intermediate temporary buffer. Here are some of the results:

 == Nokia N900 ==

 OMAP3430, ARM Cortex-A8 r1p3 @500MHz, 32-bit LPDDR @166MHz:
   direct copy:                    228.191 MB/s
   direct copy prefetched:         365.788 MB/s
   copy via tmp buffer:            154.853 MB/s
   copy via tmp buffer prefetched: 238.304 MB/s
   tmp buffer use slowdown ratio:  1.53x

 OMAP3430, ARM Cortex-A8 r1p3 @600MHz, 32-bit LPDDR @166MHz:
   direct copy:                    242.512 MB/s
   direct copy prefetched:         398.767 MB/s
   copy via tmp buffer:            174.982 MB/s
   copy via tmp buffer prefetched: 276.585 MB/s
   tmp buffer use slowdown ratio:  1.44x

 == Samsung Galaxy Tab ==

 S5PC110, ARM Cortex-A8 r2p2 @1GHz, 32-bit LPDDR @200MHz:
   direct copy:                    643.620 MB/s
   direct copy prefetched:         627.381 MB/s
   copy via tmp buffer:            671.489 MB/s
   copy via tmp buffer prefetched: 640.335 MB/s
   tmp buffer use slowdown ratio:  0.96x

As can be seen, an early revision of ARM Cortex-A8 core from OMAP3430
SoC used in Nokia N900 had a significant penalty for this type memory
access pattern. So my older recommendation had always been "use single
pass processing at any cost". Investigating it a bit more, looks like
memory bandwidth is maximized only when both read and write operations
with memory are happening at the same time, kind of "full-duplex"
behavior. And overall, the performance seems to be very far from
utilizing memory controller 100% there. Not to mention the other WTF
questions such as why prefetch degraded performance in some cases of
nearest scaling or why NEON makes memcpy faster. But last generation
chips seem to have improved really a lot. Both TI OMAP3630/DM3730 and
Samsung S5PC110 show that temporary buffer is not a problem anymore,
Samsung being just a bit faster overall (that's why I like to use it
for all my latest benchmarks).

So right now, based on the numbers from modern Cortex-A8 devices, I
would not say that single pass processing is a clear winner anymore.
Though I also would not say that "fetch -> combine -> store" is the
way to go now. One example: for this particular bilinear
over_8888_8888 operation, if we split it into fetch and combine
stages, then bilinear fetch is going to be CPU limited and combine
part is going to be memory limited. Such uneven load on memory/cpu may
make single pass processing a bit more favorable.

PS. Looks like the history is repeating, the early revisions of
Cortex-A9 based systems also seem to be having some kind of memory
performance weirdness. One example is OMAP4:
http://groups.google.com/group/pandaboard/msg/bd03264b6b800900 And
another example is here:
http://www.anandtech.com/show/4225/the-ipad-2-review/4 (just look at
the "Stdlib Write" vs. "Stdlib Copy" numbers). Hopefully this is going
to be resolved eventually. And of course, fast memory is not the most
important thing and is not a big deal for many use cases, it just
happens to be very important for pixman performance.

-- 
Best regards,
Siarhei Siamashka
-------------- next part --------------
A non-text attachment was scrubbed...
Name: neon-copy-bench.c
Type: text/x-csrc
Size: 4764 bytes
Desc: not available
URL: <http://lists.freedesktop.org/archives/pixman/attachments/20110404/89d0c373/attachment.c>