[Pixman] [PATCH] ARM: NEON: optimization for bilinear scaled 'over 8888 8888'

Mon Apr 4 23:28:55 PDT 2011

2011/4/4 Siarhei Siamashka <siarhei.siamashka at gmail.com>

>
> So what I'm going to say? If we were to release pixman as it is today
> (the code from the current pixman git master), then some of the
> bilinear scaling cases would become a lot faster, but still not cover
> everything. And also the users of cairo/pixman might find it
> beneficial to explicitly split complex operations into separate steps,
> which is a bad idea in the long run. One more part of the picture is
> the "fetch -> combine -> store" general pipeline. If we provide a
> really well optimized NEON bilinear fetcher for this pipeline, then at
> least the performance will be reasonable for a lot of operations which
> involve bilinear scaling. And after this, any single pass fast path
> will have to compete against general pipeline + NEON bilinear fetcher,
> which might still be a challenge.
>
> Regarding the "fetch -> combine -> store" pipeline, I have attached a
> simple benchmark program, which tries to estimate the overhead of
> using intermediate temporary buffer. Here are some of the results:
>
>  == Nokia N900 ==
>
>  OMAP3430, ARM Cortex-A8 r1p3 @500MHz, 32-bit LPDDR @166MHz:
>   direct copy:                    228.191 MB/s
>   direct copy prefetched:         365.788 MB/s
>   copy via tmp buffer:            154.853 MB/s
>   copy via tmp buffer prefetched: 238.304 MB/s
>   tmp buffer use slowdown ratio:  1.53x
>
>  OMAP3430, ARM Cortex-A8 r1p3 @600MHz, 32-bit LPDDR @166MHz:
>   direct copy:                    242.512 MB/s
>   direct copy prefetched:         398.767 MB/s
>   copy via tmp buffer:            174.982 MB/s
>   copy via tmp buffer prefetched: 276.585 MB/s
>   tmp buffer use slowdown ratio:  1.44x
>
>  == Samsung Galaxy Tab ==
>
>  S5PC110, ARM Cortex-A8 r2p2 @1GHz, 32-bit LPDDR @200MHz:
>   direct copy:                    643.620 MB/s
>   direct copy prefetched:         627.381 MB/s
>   copy via tmp buffer:            671.489 MB/s
>   copy via tmp buffer prefetched: 640.335 MB/s
>   tmp buffer use slowdown ratio:  0.96x
>
> As can be seen, an early revision of ARM Cortex-A8 core from OMAP3430
> SoC used in Nokia N900 had a significant penalty for this type memory
> access pattern. So my older recommendation had always been "use single
> pass processing at any cost". Investigating it a bit more, looks like
> memory bandwidth is maximized only when both read and write operations
> with memory are happening at the same time, kind of "full-duplex"
> behavior. And overall, the performance seems to be very far from
> utilizing memory controller 100% there. Not to mention the other WTF
> questions such as why prefetch degraded performance in some cases of
> nearest scaling or why NEON makes memcpy faster. But last generation
> chips seem to have improved really a lot. Both TI OMAP3630/DM3730 and
> Samsung S5PC110 show that temporary buffer is not a problem anymore,
> Samsung being just a bit faster overall (that's why I like to use it
> for all my latest benchmarks).
>
> So right now, based on the numbers from modern Cortex-A8 devices, I
> would not say that single pass processing is a clear winner anymore.
> Though I also would not say that "fetch -> combine -> store" is the
> way to go now. One example: for this particular bilinear
> over_8888_8888 operation, if we split it into fetch and combine
> stages, then bilinear fetch is going to be CPU limited and combine
> part is going to be memory limited. Such uneven load on memory/cpu may
> make single pass processing a bit more favorable.
>

Cairo actually uses many temp buffers and image objects to make problems
easier.
Each of these operations will call several malloc/free which will cause
serious inter-thread contention.
Although well implemented memory management and caching can reduce this
overhead, I want to avoid these temp approaches if possible.
Well, maybe temp buffers for scanline (not entire image) are reasonable and
easy to go in cache mechanism.
But it's not that simple problem in case of multi-thread.
In this point of view, single-pass is always favorable if it beats general
fetch->combine->store pipeline in performance.
Because it does not require alloc/free of temp buffers.
Single-pass can be a good choice for small sized image composition due to
this.

Fetch->combine->store is relatively simple to implement reusing already
existing fast path scanline functions.
And it will give us reasonable performance and maximize code utilization.
And it can also give us overlap-aware behavior in some simple cases.

Anyway we have to consider both(single-pass vs general) for various CPUs and
I've got that in mind now.
Thanks for advice.

Another interesting thing I recently found is that reading memory location
recently written causes performance decrease.
pixman_blt(src, dst...) with src != dst (far apart enough) gave me 350
MPix/s on S5PC110 with 32 bpp.
But if src == dst or src and dst are close, it dropped to 180 MPix/s.
If src and dst are more far apart, performance was increased.
I searched out many ARM documents but can't find any clues on this behavior.
First I think later case will cause more cache hit and should be much
faster.
But I was wrong.
Maybe write allocate mode with L2 cache or write buffer can be a reason for
this.
I figured out this while implementing overlap-aware blt function.
Any idea?

I attached my test code.

-- 
Best Regards,
Taekyun Kim
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.freedesktop.org/archives/pixman/attachments/20110405/b92894d4/attachment.html>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: perf.c
Type: text/x-csrc
Size: 1595 bytes
Desc: not available
URL: <http://lists.freedesktop.org/archives/pixman/attachments/20110405/b92894d4/attachment.c>