Hi,<div><br></div><div><div>I'm sorry about that I have made some mistakes in previous patch.</div><div>I have mistaken that q4~q7 registers are available for my functions.</div><div>Now it passes pixman scaling tests.</div> <div><br></div><div>Performance Benchmark Result on ARM Cortex-A8 (scaling-bench)</div><div> before : transl: op=3, src=20028888, mask=- dst=20028888, speed=5.58 MPix/s</div><div> after : transl: op=3, src=20028888, mask=- dst=20028888, speed=37.84 MPix/s</div> <div> </div><div> performance of nearest scaling over for comparison</div><div> transl: op=3, src=20028888, mask=- dst=20028888, speed=60.73 MPix/s</div><div><br></div><div> performance of bilinear scaling src for comparison</div> <div> transl: op=1, src=20028888, mask=- dst=20028888, speed=65.47 MPix/s</div><div><br></div><div><br></div><div class="gmail_quote"><br><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex;"> <div class="im">On Tue, Mar 15, 2011 at 11:02 AM, Taekyun Kim <<a href="mailto:podain77@gmail.com">podain77@gmail.com</a>> wrote:<br><br> </div>Hi, it's nice to see that you keep looking into improving bilinear<br> scaling performance for pixman. I just wonder if you have totally<br> given up on non-NEON bilinear optimizations by now? My understanding<br> was that this was the area which you originally tried to work on.<br></blockquote><div><br></div><div>I have to consider many platforms with or without SIMD.</div><div>Non-NEON bilinear optimizations are still in my concern.</div> <div>But the priority has changed temporarily for some reasons.</div><div> </div><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex;"> Also a bit tricky part is that I'm also still working on more pixman<br> ARM NEON optimizations and I'm about to submit two additional bilinear<br> performance optimizations patchsets, one of them unfortunately<br> clashing with your patch. Not to mention that NEON optimized<br> 'over_8888_8888' and 'over_8888_565' with bilinear scaled source are<br> also part of my plan, even though they are not immediately available<br> as of today.<br></blockquote><div><br></div><div>I just needed some performance data immediately at that time</div><div>and I'm waiting your patches for other bilinear operations to be released :-)</div><div> </div><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex;"> There are two pipeline stalls here on ARM Cortex-A8/A9. Most of NEON<br> instructions have latency higher than 1 and you can't use the result<br> of one instruction immediately in the next cycle without suffering<br> from performance penalty. A simple reordering of instructions resolves<br> the problem easily at least for this case:<br> <div class="im"><br> vuzp.8 d0, d1<br> vuzp.8 d2, d3<br> vuzp.8 d0, d1<br> vuzp.8 d2, d3<br> <br></div>And unfortunately here we have really a lot of pipeline stalls which<br> are a bit difficult to hide. This all does not make your solution bad,<br> and it indeed should provide a really good speedup over C code. But it<br> surely can be done a bit better.</blockquote></div><div><br></div>I cannot find proper reordering to avoid pipeline stalls in blending and interleaving.</div><div>The destination registers will be available at N6 or N4 cycle for vmul, vadd, vqadd instructions.</div> <div><meta http-equiv="content-type" content="text/html; charset=utf-8">In the case of four pixels, it seems hard to avoid pipeline stalls.</div><div>I think combining eight pixels at once will be more suitable for SW pipelining.</div> <div>And I also expect that proper prefeching and aligned write will significantly increase the performance.</div><div><br></div><div>I hope to see your patches soon.</div><div>And please leave some comments on my patch.</div> <div><br></div><div>Thank you.</div><div><br>-- <br>Best Regards,<div>Taekyun Kim</div><br> </div>