[Pixman] [PATCH] ARM: NEON: optimization for bilinear scaled 'over 8888 8888'

Mon Apr 4 04:47:52 PDT 2011

On Wed, Mar 16, 2011 at 9:33 AM, Taekyun Kim <podain77 at gmail.com> wrote:
> Hi,
> I'm sorry about that I have made some mistakes in previous patch.
> I have mistaken that q4~q7 registers are available for my functions.

No need to be sorry, no harm has been done :) Yeah, with assembly
optimizations one needs to be careful about every little thing.
Luckily pixman has a test suite which can detect almost all potential
problems with this code. Once the automated tests pass, the code needs
to pass a final round of human review just in case.

> Now it passes pixman scaling tests.

OK, thanks. That's good.

>> Also a bit tricky part is that I'm also still working on more pixman
>> ARM NEON optimizations and I'm about to submit two additional bilinear
>> performance optimizations patchsets, one of them unfortunately
>> clashing with your patch. Not to mention that NEON optimized
>> 'over_8888_8888' and 'over_8888_565' with bilinear scaled source are
>> also part of my plan, even though they are not immediately available
>> as of today.
>
> I just needed some performance data immediately at that time

It's understandable. When you have a tight schedule, it's perfectly
acceptable to hack in some local patches which do the job. And in the
long run we all want to have a clean and fast code in pixman.

> and I'm waiting your patches for other bilinear operations to be released
> :-)

Sorry, I was a little bit sick recently and dropped out for a while.
Hopefully everything can get back on track now.

>> There are two pipeline stalls here on ARM Cortex-A8/A9. Most of NEON
>> instructions have latency higher than 1 and you can't use the result
>> of one instruction immediately in the next cycle without suffering
>> from performance penalty. A simple reordering of instructions resolves
>> the problem easily at least for this case:
>>
>> vuzp.8 d0, d1
>> vuzp.8 d2, d3
>> vuzp.8 d0, d1
>> vuzp.8 d2, d3
>>
>> And unfortunately here we have really a lot of pipeline stalls which
>> are a bit difficult to hide. This all does not make your solution bad,
>> and it indeed should provide a really good speedup over C code. But it
>> surely can be done a bit better.
>
> I cannot find proper reordering to avoid pipeline stalls in blending and
> interleaving.
> The destination registers will be available at N6 or N4 cycle for vmul,
> vadd, vqadd instructions.
> In the case of four pixels, it seems hard to avoid pipeline stalls.

Software pipelining is useful for avoiding stalls, and given enough
efforts it's possible to get rid of all of them (a few may remain
unnoticed though in our non-prefect world).

> I think combining eight pixels at once will be more suitable for SW
> pipelining.

Unrolling to process 8 pixels per loop iteration is more convenient
for compositing operations such as OVER, because it allows to use the
full width of NEON registers more efficiently. So we execute less
instructions for doing the same amount of work. A drawback is that it
roughly doubles the size of code.

> And I also expect that proper prefeching and aligned write will
> significantly increase the performance.

Improvement from doing aligned writes does not actually provide a
really significant improvement, but it's surely measurable and
cumulative with other optimizations.

> I hope to see your patches soon

> And please leave some comments on my patch.

Sure, here they are:

+/* Combine function for operator OVER */
+.macro bilinear_combine_over dst_fmt, numpix
+	vpush		{ q4, q5 }

It's better to do vpush/vpop at the function entry/exit and not in the
inner loop.

+/* Destination pixel load functions for bilinear_combine_XXXX */
+.macro bilinear_load_dst_8888 numpix
+.if numpix == 4
+	pld			[OUT, #64]

It's better to have the same N pixels ahead prefetch distance for both
source and destination.

The whole prefetch stuff works in the following way. Using PLD
instruction, you poke some address in memory which is going to be
accessed soon, and then you need to keep CPU busy doing something else
for hundred(s) of cycles before the data is fetched into cache. So
it's not something like "let's prefetch one cache line ahead" because
such prefetch distance can be easily too small. Optimal prefetch
distance depends on how heavy are the computations done in the
function per loop iteration. So for unscaled fast paths or nearest
scaling we need to prefetch really far ahead. For more CPU heavy
bilinear interpolation, prefetch distance can be a lot smaller. But
the rough estimation is the following: if we prefetch N pixels ahead,
and we use X cycles per pixel, then memory controller has N * X cycles
time to fetch the needed data into cache before we attempt to access
this data. Of course everything gets messy when memory bandwidth is
the bottleneck and we are waiting for memory anyway, also there is
cache line granularity to consider, etc. But increasing prefetch
distance in an experimental way until it stops providing performance
improvement still works well in practice :) Still the point is that we
select some prefetch distance, and it is N pixels ahead, synchronized
for both source and destination fetches.

-- 
Best regards,
Siarhei Siamashka