Hi,<div><br></div><div><div>I'm sorry about that I have made some mistakes in previous patch.</div><div>I have mistaken that q4~q7 registers are available for my functions.</div><div>Now it passes pixman scaling tests.</div>
<div><br></div><div>Performance Benchmark Result on ARM Cortex-A8 (scaling-bench)</div><div> before : transl: op=3, src=20028888, mask=- dst=20028888, speed=5.58 MPix/s</div><div> after : transl: op=3, src=20028888, mask=- dst=20028888, speed=37.84 MPix/s</div>
<div> </div><div> performance of nearest scaling over for comparison</div><div> transl: op=3, src=20028888, mask=- dst=20028888, speed=60.73 MPix/s</div><div><br></div><div> performance of bilinear scaling src for comparison</div>
<div> transl: op=1, src=20028888, mask=- dst=20028888, speed=65.47 MPix/s</div><div><br></div><div><br></div><div class="gmail_quote"><br><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex;">
<div class="im">On Tue, Mar 15, 2011 at 11:02 AM, Taekyun Kim <<a href="mailto:podain77@gmail.com">podain77@gmail.com</a>> wrote:<br><br>
</div>Hi, it's nice to see that you keep looking into improving bilinear<br>
scaling performance for pixman. I just wonder if you have totally<br>
given up on non-NEON bilinear optimizations by now? My understanding<br>
was that this was the area which you originally tried to work on.<br></blockquote><div><br></div><div>I have to consider many platforms with or without SIMD.</div><div>Non-NEON bilinear optimizations are still in my concern.</div>
<div>But the priority has changed temporarily for some reasons.</div><div> </div><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex;">
Also a bit tricky part is that I'm also still working on more pixman<br>
ARM NEON optimizations and I'm about to submit two additional bilinear<br>
performance optimizations patchsets, one of them unfortunately<br>
clashing with your patch. Not to mention that NEON optimized<br>
'over_8888_8888' and 'over_8888_565' with bilinear scaled source are<br>
also part of my plan, even though they are not immediately available<br>
as of today.<br></blockquote><div><br></div><div>I just needed some performance data immediately at that time</div><div>and I'm waiting your patches for other bilinear operations to be released :-)</div><div> </div><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex;">
There are two pipeline stalls here on ARM Cortex-A8/A9. Most of NEON<br>
instructions have latency higher than 1 and you can't use the result<br>
of one instruction immediately in the next cycle without suffering<br>
from performance penalty. A simple reordering of instructions resolves<br>
the problem easily at least for this case:<br>
<div class="im"><br>
vuzp.8 d0, d1<br>
vuzp.8 d2, d3<br>
vuzp.8 d0, d1<br>
vuzp.8 d2, d3<br>
<br></div>And unfortunately here we have really a lot of pipeline stalls which<br>
are a bit difficult to hide. This all does not make your solution bad,<br>
and it indeed should provide a really good speedup over C code. But it<br>
surely can be done a bit better.</blockquote></div><div><br></div>I cannot find proper reordering to avoid pipeline stalls in blending and interleaving.</div><div>The destination registers will be available at N6 or N4 cycle for vmul, vadd, vqadd instructions.</div>
<div><meta http-equiv="content-type" content="text/html; charset=utf-8">In the case of four pixels, it seems hard to avoid pipeline stalls.</div><div>I think combining eight pixels at once will be more suitable for SW pipelining.</div>
<div>And I also expect that proper prefeching and aligned write will significantly increase the performance.</div><div><br></div><div>I hope to see your patches soon.</div><div>And please leave some comments on my patch.</div>
<div><br></div><div>Thank you.</div><div><br>-- <br>Best Regards,<div>Taekyun Kim</div><br>
</div>