[Pixman] [PATCH] ARM: NEON: optimization for bilinear scaled 'over 8888 8888'

Wed Mar 16 00:33:15 PDT 2011

Hi,

I'm sorry about that I have made some mistakes in previous patch.
I have mistaken that q4~q7 registers are available for my functions.
Now it passes pixman scaling tests.

Performance Benchmark Result on ARM Cortex-A8 (scaling-bench)
  before : transl: op=3, src=20028888, mask=- dst=20028888, speed=5.58
MPix/s
  after :   transl: op=3, src=20028888, mask=- dst=20028888, speed=37.84
MPix/s

  performance of nearest scaling over for comparison
              transl: op=3, src=20028888, mask=- dst=20028888, speed=60.73
MPix/s

  performance of bilinear scaling src for comparison
              transl: op=1, src=20028888, mask=- dst=20028888, speed=65.47
MPix/s

On Tue, Mar 15, 2011 at 11:02 AM, Taekyun Kim <podain77 at gmail.com> wrote:
>
> Hi, it's nice to see that you keep looking into improving bilinear
> scaling performance for pixman. I just wonder if you have totally
> given up on non-NEON bilinear optimizations by now? My understanding
> was that this was the area which you originally tried to work on.
>

I have to consider many platforms with or without SIMD.
Non-NEON bilinear optimizations are still in my concern.
But the priority has changed temporarily for some reasons.

> Also a bit tricky part is that I'm also still working on more pixman
> ARM NEON optimizations and I'm about to submit two additional bilinear
> performance optimizations patchsets, one of them unfortunately
> clashing with your patch. Not to mention that NEON optimized
> 'over_8888_8888' and 'over_8888_565' with bilinear scaled source are
> also part of my plan, even though they are not immediately available
> as of today.
>

I just needed some performance data immediately at that time
and I'm waiting your patches for other bilinear operations to be released
:-)

> There are two pipeline stalls here on ARM Cortex-A8/A9. Most of NEON
> instructions have latency higher than 1 and you can't use the result
> of one instruction immediately in the next cycle without suffering
> from performance penalty. A simple reordering of instructions resolves
> the problem easily at least for this case:
>
> vuzp.8 d0, d1
> vuzp.8 d2, d3
> vuzp.8 d0, d1
> vuzp.8 d2, d3
>
> And unfortunately here we have really a lot of pipeline stalls which
> are a bit difficult to hide. This all does not make your solution bad,
> and it indeed should provide a really good speedup over C code. But it
> surely can be done a bit better.

I cannot find proper reordering to avoid pipeline stalls in blending and
interleaving.
The destination registers will be available at N6 or N4 cycle for vmul,
vadd, vqadd instructions.
In the case of four pixels, it seems hard to avoid pipeline stalls.
I think combining eight pixels at once will be more suitable for SW
pipelining.
And I also expect that proper prefeching and aligned write will
significantly increase the performance.

I hope to see your patches soon.
And please leave some comments on my patch.

Thank you.

-- 
Best Regards,
Taekyun Kim
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.freedesktop.org/archives/pixman/attachments/20110316/445652a4/attachment.html>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: 0001-ARM-NEON-optimizations-for-bilinear-sclaed-over_8888.patch
Type: text/x-patch
Size: 6695 bytes
Desc: not available
URL: <http://lists.freedesktop.org/archives/pixman/attachments/20110316/445652a4/attachment.bin>