[Pixman] [PATCH] ARM: NEON: optimization for bilinear scaled 'over 8888 8888'

Tue Apr 5 05:27:41 PDT 2011

On Tue, Apr 5, 2011 at 9:26 AM, Taekyun Kim <podain77 at gmail.com> wrote:
> 2011/4/4 Siarhei Siamashka <siarhei.siamashka at gmail.com>
>> So right now, based on the numbers from modern Cortex-A8 devices, I
>> would not say that single pass processing is a clear winner anymore.
>> Though I also would not say that "fetch -> combine -> store" is the
>> way to go now. One example: for this particular bilinear
>> over_8888_8888 operation, if we split it into fetch and combine
>> stages, then bilinear fetch is going to be CPU limited and combine
>> part is going to be memory limited. Such uneven load on memory/cpu may
>> make single pass processing a bit more favorable.
>
> Cairo actually uses many temp buffers and image objects to make problems
> easier.
> Each of these operations will call several malloc/free which will cause
> serious inter-thread contention.
> Although well implemented memory management and caching can reduce this
> overhead, I want to avoid these temp approaches if possible.
> Well, maybe temp buffers for scanline (not entire image) are reasonable and
> easy to go in cache mechanism.
> But it's not that simple problem in case of multi-thread.
> In this point of view, single-pass is always favorable if it beats general
> fetch->combine->store pipeline in performance.
> Because it does not require alloc/free of temp buffers.
> Single-pass can be a good choice for small sized image composition due to
> this.

Reasonably small temporary buffers are allocated on stack even now, so
it does not seem to be a big problem.

> Fetch->combine->store is relatively simple to implement reusing already
> existing fast path scanline functions.
> And it will give us reasonable performance and maximize code utilization.
> And it can also give us overlap-aware behavior in some simple cases.
> Anyway we have to consider both(single-pass vs general) for various CPUs and
> I've got that in mind now.
> Thanks for advice.

You are welcome :)

> Another interesting thing I recently found is that reading memory location
> recently written causes performance decrease.
> pixman_blt(src, dst...) with src != dst (far apart enough) gave me 350
> MPix/s on S5PC110 with 32 bpp.
> But if src == dst or src and dst are close, it dropped to 180 MPix/s.
> If src and dst are more far apart, performance was increased.
> I searched out many ARM documents but can't find any clues on this behavior.
> First I think later case will cause more cache hit and should be much
> faster.
> But I was wrong.
> Maybe write allocate mode with L2 cache or write buffer can be a reason for
> this.
> I figured out this while implementing overlap-aware blt function.
> Any idea?
> I attached my test code.

You do not seem to have explicit initialization of the memory buffer
and this is bad. Because of COW, all your buffer may be easily mapped
to a single zero page, distorting benchmark results due to much better
caching.

BTW, I also have overlap-aware blt code along with a test program to
check correctness of this functionality:
http://cgit.freedesktop.org/~siamashka/pixman/log/?h=overlapped_blt

--
Best regards,
Siarhei Siamashka