[Pixman] [PATCH] Add support for aarch64 neon optimization

Tue Apr 5 14:00:07 UTC 2016

Hi Ben,

On Mon, 04 Apr 2016 19:53:36 +0100
"Ben Avison" <bavison at riscosopen.org> wrote:

> On Sat, 02 Apr 2016 13:30:58 +0100, Mizuki Asakura <ed6e117f at gmail.com> wrote:
> > This patch only contains STD_FAST_PATH codes, not scaling (nearest,
> > bilinear) codes.  
> 
> Hi Mizuki,
> 
> It looks like you have used an automated process to convert the AArch32
> NEON code to AArch64. Will you be able to repeat that process for other
> code, or at least assist others to repeat your steps?
> 
> The reason I ask is that I have a large number of outstanding patches to
> the ARM NEON support. The process of getting them merged into the
> FreeDesktop git repository has been very slow because there aren't many
> people on this list with the time and ability to review them, however my
> versions are in many cases up to twice the speed of the FreeDesktop
> versions, and it would be a shame if AArch64 couldn't benefit from them.

It is always possible to find time for fixing bugs and reviewing the
code, which is doing something obviously useful.

> If your AArch64 conversion is a one-time thing, it will make make it
> extremely difficult to merge my changes in.

Yes, the way how we are going to keep the 32-bit and 64-bit code in sync
is one of the concerns that have to be addressed.

But you don't need to worry about this right now. Let's focus at one
task at a time. So it's probably best to use the current pixman code
for the initial AArch64 conversion round. And at the same time we can
try to look at integrating your patches as 32-bit code first.

> > After completing optimization this patch, scaling related codes should be done.  
> 
> One of my aims was to implement missing "iter" routines so as to accelerate
> scaled plots for a much wider combination of pixels formats and Porter-Duff
> combiner rules than the existing limited selection of fast paths could
> cover. If you look towards the end of my patch series here:
> 
> https://github.com/bavison/pixman/commits/arm-neon-release1
> 
> you'll see that I discovered that I was actually outperforming Pixman's
> existing bilinear plotters so consistently that I'm advocating removing
> them entirely,

Please hold your horses!

I did give this branch a try. And I'm not sure if you genuinely did
not notice this fact, but with your runs of "lowlevel-blt-bench -b"
tests, you are actually benchmarking a special code path for
horizontal-only scaling against the code that is doing scaling in both
directions. That's an obvious flaw in the benchmark itself, which gives
you misleading results! I have already sent a fix for this problem:

    https://lists.freedesktop.org/archives/pixman/2016-April/004511.html

And I have also pushed your patches with this lowlevel-blt-bench fix to
the following git branch:

    https://cgit.freedesktop.org/~siamashka/pixman/log/?h=20160405-arm-neon-release1-from-bavison

Reverting your removal of the existing bilinear plotters allows us to
benchmark the implementations against each other via setting the
PIXMAN_DISABLE=wholeops environment variable. Here are the results
that I got:

===========================
== ARM Cortex-A7 @1.3GHz ==
===========================

    $ ./lowlevel-blt-bench -b ... (old NEON bilinear fast paths)

           src_8888_8888 =  L1:  51.40  L2:  43.62  M: 48.08 ( 35.48%)  HT: 32.84  VT: 31.14  R: 27.95  RT: 14.12 ( 102Kops/s)
           src_8888_0565 =  L1:  48.61  L2:  44.98  M: 46.56 ( 25.72%)  HT: 32.31  VT: 32.22  R: 26.04  RT: 13.08 (  98Kops/s)
          over_8888_8888 =  L1:  40.44  L2:  34.28  M: 34.44 ( 25.39%)  HT: 22.75  VT: 22.67  R: 19.45  RT:  9.56 (  82Kops/s)

    $ PIXMAN_DISABLE=wholeops ./lowlevel-blt-bench -b ... (new separable NEON iterators)

           src_8888_8888 =  L1:  45.32  L2:  56.91  M: 47.55 ( 34.96%)  HT: 34.17  VT: 29.90  R: 26.83  RT: 10.79 (  80Kops/s)
           src_8888_0565 =  L1:  38.08  L2:  47.27  M: 48.13 ( 26.61%)  HT: 31.18  VT: 25.89  R: 23.72  RT:  8.87 (  70Kops/s)
          over_8888_8888 =  L1:  36.15  L2:  34.68  M: 33.45 ( 24.66%)  HT: 24.91  VT: 20.88  R: 19.38  RT:  8.23 (  68Kops/s)

    $ PIXMAN_DISABLE=wholeops ./lowlevel-blt-bench -bh ... (new separable NEON iterators, only horizontal)

           src_8888_8888 =  L1:  92.51  L2:  74.43  M: 65.66 ( 47.27%)  HT: 40.34  VT: 33.82  R: 30.02  RT: 11.76 (  84Kops/s)
           src_8888_0565 =  L1:  71.68  L2:  63.86  M: 58.99 ( 32.20%)  HT: 36.19  VT: 28.96  R: 26.57  RT:  9.68 (  74Kops/s)
          over_8888_8888 =  L1:  61.80  L2:  44.28  M: 40.05 ( 29.13%)  HT: 27.69  VT: 22.71  R: 21.11  RT:  8.86 (  71Kops/s)

=============================
=== ARM Cortex-A9 @1.4GHz ===
=============================

    $ ./lowlevel-blt-bench -b ... (old NEON bilinear fast paths)

           src_8888_8888 =  L1: 115.82  L2: 113.35  M:107.33 (117.45%)  HT: 57.95  VT: 48.20  R: 39.86  RT: 20.75 ( 140Kops/s)
           src_8888_0565 =  L1: 105.67  L2: 104.18  M: 99.80 ( 82.02%)  HT: 54.40  VT: 46.61  R: 37.24  RT: 19.10 ( 134Kops/s)
          over_8888_8888 =  L1:  80.68  L2:  79.07  M: 75.95 ( 83.09%)  HT: 38.69  VT: 29.25  R: 26.37  RT: 13.79 ( 112Kops/s)

    $ PIXMAN_DISABLE=wholeops ./lowlevel-blt-bench -b ... (new separable NEON iterators)

           src_8888_8888 =  L1:  52.17  L2:  69.91  M: 49.56 ( 54.43%)  HT: 43.70  VT: 36.51  R: 31.96  RT: 16.02 ( 112Kops/s)
           src_8888_0565 =  L1:  43.51  L2:  61.72  M: 47.09 ( 38.76%)  HT: 37.19  VT: 31.19  R: 27.30  RT: 12.64 (  97Kops/s)
          over_8888_8888 =  L1:  44.85  L2:  52.69  M: 25.41 ( 27.85%)  HT: 24.53  VT: 21.33  R: 20.29  RT: 11.73 (  94Kops/s)

    $ PIXMAN_DISABLE=wholeops ./lowlevel-blt-bench -bh ... (new separable NEON iterators, only horizontal)

           src_8888_8888 =  L1:  99.47  L2:  87.14  M: 63.50 ( 69.66%)  HT: 48.91  VT: 40.11  R: 35.05  RT: 17.69 ( 118Kops/s)
           src_8888_0565 =  L1:  81.97  L2:  76.98  M: 61.69 ( 50.79%)  HT: 40.95  VT: 33.84  R: 29.54  RT: 13.74 ( 102Kops/s)
          over_8888_8888 =  L1:  82.53  L2:  62.73  M: 29.01 ( 31.82%)  HT: 26.11  VT: 22.32  R: 21.50  RT: 12.67 (  99Kops/s)

============================

The accuracy of measurements is not perfect, but it is good enough to
easily see the pattern.

Your implementations are providing better performance on ARM Cortex-A7
for the special case of horizontal-only scaling. But when we are doing
general purpose scaling (for example the same scale factor for the x
and y axes), the performance is more or less comparable between the
existing bilinear fast paths and your new implementation.

And if we take a different processor, for example ARM Cortex-A9, then
everything changes. Now the general purpose scaling is much slower
with your code. And even the horizontal-only special case is somewhat
slower than the old fast paths which run general purpose bilinear
scaling.

I would say that rather than dropping the existing bilinear scaling
code, it might be worth trying to implement the horizontal-only and
the vertical-only special cases to see how much faster it can get.

However if we try to do a really fair comparison, then we need to get
rid of the redundant memcpy for the bilinear src_8888_8888 operation in
the iterators based implementation. I had some old patches for this:
    https://lists.freedesktop.org/archives/pixman/2013-September/002879.html

These patches were used in an old git branch:
    https://cgit.freedesktop.org/~siamashka/pixman/log/?h=ssse3-bilinear-fast-path-test

which I used for comparing the fast path based SSE2/SSSE3 bilinear
scaling implementation with the separable SSSE3 implementation on
two x86 processors: Core2 T7300 (Merom) and Core i7 860 (Nehalem):
    https://lists.freedesktop.org/archives/pixman/2013-September/002892.html
    https://people.freedesktop.org/~siamashka/files/20130905/ssse3-scaling-bench.png

It shows that on x86 processors, the straightforward single pass
implementation works faster for downscaling, but becomes worse than
the separable SSSE3 scaling code on upscaling. There is a crossover
point, which may be different for different processors.

Now I have rebased all this old code on a current pixman master
branch, applied your patches with the separable NEON bilinear scaling
implementation and have done pretty much the same benchmarks. This
git branch is here:
    https://cgit.freedesktop.org/~siamashka/pixman/log/?h=20160405-separable-neon-bilinear-test

This particular change makes your implementation faster by getting rid
of the redundant memcpy:
    https://cgit.freedesktop.org/~siamashka/pixman/commit/?h=20160405-separable-neon-bilinear-test&id=c8c12d62098bf5b22828f851a035b6c4ecf0fc0b

And here are the results from different ARM processors (all the
benchmark scripts are also included):
    https://people.freedesktop.org/~siamashka/files/20160405-arm-bilinear/

It appears that the existing bilinear scaling code (src_8888_8888
ARM NEON fast path) works faster for downscaling on Cortex-A8,
Cortex-A9, Cortex-A15 and Qualcomm Krait:
    https://people.freedesktop.org/~siamashka/files/20160405-arm-bilinear/neon-scaling-bench-a8-a9.png
    https://people.freedesktop.org/~siamashka/files/20160405-arm-bilinear/neon-scaling-bench-a15-krait.png

Also your separable implementation is always faster for both
upscaling/downscaling on ARM Cortex-A7, while the crossover point
for ARM Cortex-A53 is currently around ~0.7x scaling factor, which
is not exactly great:
    https://people.freedesktop.org/~siamashka/files/20160405-arm-bilinear/neon-scaling-bench-a7-a53.png

> with the additional advantage that it simplifies the code
> base a lot. So you might want to consider whether it's worth bothering
> converting those to AArch64 in the first place.

I'll try to tune the single-pass bilinear scaling code for Cortex-A53
to see if it can become more competitive. There are some obvious
performance problems in it. For example, Cortex-A53 wants the ARM
and NEON instructions to be interleaved in perfect pairs, so we can't
have separate chunks of ARM and NEON code anymore:
    https://cgit.freedesktop.org/pixman/tree/pixman/pixman-arm-neon-asm.S?id=pixman-0.34.0#n3313

Now the instructions need to be mixed. Cortex-A8 had separate
deep queue for NEON instructions, so it did not mind. As a side
effect, now the "advanced prefetch" instructions sequence in the
32-bit ARM code needs to be carefully interleaved with the NEON
instructions if we want it to run fast on Cortex-A53.

Another problem is that now there is at least 1 cycle stall between
vshll.u16 and vmlsl.u16 here:

        vshll.u16 q2, d20, #BILINEAR_INTERPOLATION_BITS
        vmlsl.u16 q2, d20, d30
        vmlal.u16 q2, d21, d30

Cortex-A8 could run this instructions sequence in 3 cycles without
stalls. Maybe there is something else. I have a sample code here
if anyone cares to experiment:
    https://gist.github.com/ssvb/343379ceeb6d017c0023424b70fc90e2

Anyway, I'll try to make a Cortex-A53 optimized version and also a
special horizontal scaling shortcut. Let's see how fast it can
be :-) 

> I would maybe go so far as to suggest that you try converting all the iters
> first and only add fast paths if you find they do better than the iters.
> One of the drawbacks of using iters is that the prefetch code can't be as
> sophisticated - it can't easily be prefetching the start of the next row
> while it is still working on the end of the current one. But since hardware
> prefetchers are better now and conditional execution is hard in AArch64,
> this will be less of a drawback with AArch64 CPUs.
> 
> I'll also repeat what has been said, that it's very neat the way the
> existing prefetch code sneaks calculations into pipeline stalls, but it was
> only ever really ideal for Cortex-A8. With Cortex-A7 (despite the number,
> actually a much more recent 32-bit core) I noted that it was impossible to
> schedule such complex prefetch code without adding to the cycle count, at
> least when the images were already in the cache.

Yes, Cortex-A7 is a pretty awful processor. It had the NEON unit cut
in half (can process only 64 bits of data per cycle) and also its
automatic hardware prefetcher can track just a single stream, which
is not good enough for pixman. This is an oddball hardware, which
differs a lot from most of the other ARM processors. I have not had
any experience with any Cortex-A5 based device though, probably it
may be roughly the same or even worse.

Regarding your patches in general. It would be great to have more
comments about the algorithm details, registers layout documentation
and other stuff. It would be great not to regress performance on non
Cortex-A7 processors (we probably need runtime detection and multiple
implementations). It would be also great to have your new iterators
framework eventually shared between ARM, x86 and other architectures
to avoid unnecessary code duplication.

Thanks for you work.

-- 
Best regards,
Siarhei Siamashka