[Pixman] [PATCH] ARM: NEON: optimization for bilinear scaled 'over 8888 8888'

Mon Apr 4 09:12:36 PDT 2011

2011/4/4 Siarhei Siamashka siarhei.siamashka at gmail.com

>
> Sorry, I was a little bit sick recently and dropped out for a while.
> Hopefully everything can get back on track now.

Oh... Are you OK now? Take care, nothing is more important than health.
Anyway welcome back :-)

> Improvement from doing aligned writes does not actually provide a
> really significant improvement, but it's surely measurable and
> cumulative with other optimizations.
>
I didn't test the performance of unaligned write on ARM core.
(I just simly thought destination alignment hint of vst instructions)
But on Intel Core2-Quad CPUs, aligned write memset32 (for solid
filling) using rep stosd hit the abstract memory bandwidth limit.
I think rep sto/mov X86 instructions can give CPU more chance to control
burst write synchronization. (even faster then SSE2 one)
Aligned write is some what HW dependent, I think it has to be carefully
checked whether there is actual performance gain inspite of alignment
overhead for various length scanlines.

>
> +/* Combine function for operator OVER */
> +.macro bilinear_combine_over dst_fmt, numpix
> +       vpush           { q4, q5 }
>
> It's better to do vpush/vpop at the function entry/exit and not in the
> inner loop.
>
According to ARM eabi calling convention, only q4-q7 registers are need to
be restored.
I didn't know about that at that time.
So I used some other registers to avoid stack push/pop.

> +/* Destination pixel load functions for bilinear_combine_XXXX */
> +.macro bilinear_load_dst_8888 numpix
> +.if numpix == 4
> +       pld                     [OUT, #64]
>
> It's better to have the same N pixels ahead prefetch distance for both
> source and destination.
>
> The whole prefetch stuff works in the following way. Using PLD
> instruction, you poke some address in memory which is going to be
> accessed soon, and then you need to keep CPU busy doing something else
> for hundred(s) of cycles before the data is fetched into cache. So
> it's not something like "let's prefetch one cache line ahead" because
> such prefetch distance can be easily too small. Optimal prefetch
> distance depends on how heavy are the computations done in the
> function per loop iteration. So for unscaled fast paths or nearest
> scaling we need to prefetch really far ahead. For more CPU heavy
> bilinear interpolation, prefetch distance can be a lot smaller. But
> the rough estimation is the following: if we prefetch N pixels ahead,
> and we use X cycles per pixel, then memory controller has N * X cycles
> time to fetch the needed data into cache before we attempt to access
> this data. Of course everything gets messy when memory bandwidth is
> the bottleneck and we are waiting for memory anyway, also there is
> cache line granularity to consider, etc. But increasing prefetch
> distance in an experimental way until it stops providing performance
> improvement still works well in practice :) Still the point is that we
> select some prefetch distance, and it is N pixels ahead, synchronized
> for both source and destination fetches.
>
I've done various experiments on PLD instruction.
I removed cache preload in neon fast path functions and then benchmarked,
there was no difference at performance.
I tested some other neon functions (like memcpy) in similar way, but no
difference at all.
As I know coretex-a8 have preload engine (maybe not according to different
SoC integration??) but PLD is just an hint to the HW.
So it is implementation dependent, right?
I need further investigation on this and need to acquire various targets
with different SoC.

Thanks for you kind explanation about prefetching.
I've learned many knowledges and way of thinking from you and your codes :-)
I really appreciate that.

I've done some additional work on overlapped blit functions and bilinear
filter with A8 mask for operator OVER and ADD.
(tight scheduled work here...)
Here I have some issues (related to memory access patterns).
Maybe we can discuss about this on next posting.

-- 
Best Regards,
Taekyun Kim
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.freedesktop.org/archives/pixman/attachments/20110405/e0a975a7/attachment.html>