[Pixman] [PATCH] Add support for aarch64 neon optimization

Thu Apr 7 12:49:11 UTC 2016

Hi Siarhei,

> It is always possible to find time for fixing bugs and reviewing the
> code, which is doing something obviously useful.

I think we have been unlucky because it seems to me that our schedules
haven't aligned very well. I have put time aside on a few occasions over
the last couple of years to make a concentrated effort to get my ARM
patches accepted, but there hasn't been anyone around with the ability to
review them at those times. I can see you're working on it at present, but
for personal reasons I don't have much time right now for detailed
discussion, benchmarking and reworking of my patches!

I still read the list though, so I was naturally very interested to see
AArch64 being discussed. I hate to see effort going to waste though,
whether it's mine or anyone else's, and it looked like Mizuki was unaware
of my work, so I thought I should mention it.

> I did give this branch a try. And I'm not sure if you genuinely did
> not notice this fact, but with your runs of "lowlevel-blt-bench -b"
> tests, you are actually benchmarking a special code path for
> horizontal-only scaling against the code that is doing scaling in both
> directions. That's an obvious flaw in the benchmark itself, which gives
> you misleading results!

In the past I had criticism for posting patch series that were too long to
be easily reviewed, so I have been dribbling them in piece by piece. I see
you also identified the limitations of lowlevel-blt-bench when it comes to
scaled plots - that's why I created affine-bench. That program *has* been
accepted into git, although my separable bilinear fetchers haven't yet,
even though they were the reason why I wrote it! Perhaps you may not have
noticed it? That specifically allows different scaling factors to be
tested. For example, for my ARMv6 separable bilinear filter code, I
recorded benchmarks in my commit log as follows:

   Improvement:
       x increment   0.5     0.75    1.0     1.5     2.0
   y increment
       0.5           +196.7% +183.6% +181.8% +206.6% +198.4%
       0.75          +182.2% +166.2% +164.0% +194.8% +185.8%
       1.0           +271.7% +234.4%         +282.7% +257.9%
       1.5           +154.6% +135.3% +134.3% +173.3% +164.8%
       2.0           +144.1% +124.2% +123.3% +165.6% +155.5%

It looks like I didn't record them in the ARMv7 separable bilinear filter
commit, but some of my private notes indicate that before I wrote the ARMv7
separable bilinear filter, I was measuring my ARMv6 separable bilinear
filter as around 20% faster on Cortex-A7 than the old ARMv7 bilinear fast
path for small y increments, even though it didn't have the advantage of
being able to use NEON instructions. That's how I decided it was worth the
effort of writing the ARMv7 separable filter in the first place.

The reason why I didn't include them is because profiling takes a lot of
time, and because I was intending to dribble the commits to the mailing
list in small batches, I decided to defer the detailed profiling until I
was ready to post those patches, in case any other patches accepted in the
meantime affected the numbers.

It is interesting to notice that the ARMv6 improvements in my table above
agree with your assessment that y factors of 1 are not particularly
representative. It also shows that the separable filters show the most
improvement when the y increment is small. This makes sense because the
buffer containing first-pass scaled data gets more re-use when the y factor
is small. I don't think it's unreasonable to pay close attention to the
performance for small y increments, because that's exactly the case where
you would expect to use bilinear filtering if you care about image quality.
For increments above 1, image quality is best served by a multi-tap filter
instead.

> Your implementations are providing better performance on ARM Cortex-A7
> for the special case of horizontal-only scaling. But when we are doing
> general purpose scaling (for example the same scale factor for the x
> and y axes), the performance is more or less comparable between the
> existing bilinear fast paths and your new implementation.
>
> And if we take a different processor, for example ARM Cortex-A9, then
> everything changes. Now the general purpose scaling is much slower
> with your code. And even the horizontal-only special case is somewhat
> slower than the old fast paths which run general purpose bilinear
> scaling.

OK, that's a bit disappointing, though it should be offset to some extent
by the reduction in calculations required when using scaling factors less
than 1, and of course the fact that the fetchers are applicable to a much
wider range of operations than the fast paths are.

It's possible I've chosen some code sequences that are particularly painful
for Cortex-A9. Since ARM have stopped publishing details of cycle counts,
interlocks etc, it has become quite difficult to hand-schedule code, and
IIRC at least the NEON pipeline in the A9 is in-order, like the A8, A5, A7
and A53, and whilst they can be figured out by experiment, it's very time
consuming and requires access to lots of hardware.

There is the question of whether we would want Pixman to have separate
routines optimized for each core, or just try to choose one routine which
is best all-round. But then who decides which cores are most important to
test with?

> However if we try to do a really fair comparison, then we need to get
> rid of the redundant memcpy for the bilinear src_8888_8888 operation in
> the iterators based implementation. I had some old patches for this:
>     https://lists.freedesktop.org/archives/pixman/2013-September/002879.html

Interesting. That seems a reasonable approach, and levels the playing field
a bit between iterators and fast paths. My pass-2 (vertical interpolation)
code currently assumes that the cacheline alignment is the same between its
input and output buffers, but I can imagine that losing the memcpy that
follows it for SRC operations could make it a net win. And in theory I
could add versions of the pass-2 code to support differing alignments to
improve the situation further.

> I'll try to tune the single-pass bilinear scaling code for Cortex-A53
> to see if it can become more competitive. There are some obvious
> performance problems in it. For example, Cortex-A53 wants the ARM
> and NEON instructions to be interleaved in perfect pairs, so we can't
> have separate chunks of ARM and NEON code anymore

Also interesting. I wasn't aware of that about the A53, although I admit I
haven't gone looking for information about its microarchitecture yet. Are
you aware of anywhere that information about the quirks of this and other
cores are documented?

> But you don't need to worry about this right now. Let's focus at one
> task at a time. So it's probably best to use the current pixman code
> for the initial AArch64 conversion round. And at the same time we can
> try to look at integrating your patches as 32-bit code first.

Thanks for all your detailed analysis, Siarhei. It's a very pleasant change
after months of my patches being ignored or getting bogged down in
discussing minutiae. If you've actually got time to look at some of my code
at present, I could perhaps repost a few of them to get things started,
perhaps focusing on the NEON ones as they're the ones pertinent to the
AArch64 conversion. I can't promise to make major reworks to anything in
the short term, but maybe at least some of them might be straightforward to
get accepted. I'd love to be able to merge my branch fully eventually; at
the moment Raspberry Pi is using my branch, but it'll cause a big headache
if/when they move to AArch64.

Ben