[Pixman] [PATCH] Add support for aarch64 neon optimization

Sat Apr 2 05:31:31 UTC 2016

Hi,

On Wed, 30 Mar 2016 20:47:01 +0900
Mizuki Asakura <ed6e117f at gmail.com> wrote:

> Since aarch64 has different neon syntax from aarch32 and has no
> support of (older) arm-simd,
> there are no SIMD accelerations for pixman on aarch64.
> 
> We need new implementations.

Thanks for starting this work. That's an excellent news. Some very
affordable and quite capable 64-bit ARM development boards (PINE64
and ODROID-C2) became available since the beginning of March 2016.
Their users clearly need pixman optimizations for running Linux
desktop faster on 64-bit systems. If Raspberry Pi 3 gets 64-bit
software support, then their users will need these optimizations too.

For the last week or so I have been also thinking about what would be
the best way to add 64-bit ARM assembly support to pixman. Indeed the
NEON assembly syntax has changed quite radically in binutils compared
to the 32-bit ARM variant. While the NEON instructions themselves
remain mostly the same, with just a doubled number of registers and
some incompatible changes affecting a few instructions.

There were also some recurring requests asking for iOS support
in the past. The assembler in iOS happens to have a somewhat
incompatible syntax too:
    https://lists.freedesktop.org/archives/pixman/2011-March/001110.html

And the ILP32 ABI variant for AArch64 (something similar to the x32
ABI) may potentially make our life harder in the future.

The main concern is whether we can get support for all these
assembly syntax flavours, calling conventions and minor instructions
differences, while keeping the code maintainable. Preferably we should
avoid doing a lot of copy-paste job. Replicating the same changes to
multiple source files is rather inconvenient.

Yes, the compiler people are always advocating the use of intrinsics
as a universal solution. But the compiler generated code is typically
not very good and some performance is lost.

I'm currently investigating the possibility of maybe "inventing" a
common unified syntax for some preprocessor script to take care of
both 32-bit and 64-bit ARM code. But we still can't be sure if this
approach is practical (or even doable) and whether it can solve more
problems than introduce new ones...

Still as a short term solution (and we know that "temporary" solutions
sometimes end up being rather permanent), introducing separate assembly
source files for AArch64 is quite reasonable now. So your patch is very
handy and comes exactly at the right time.

> Added: https://bugs.freedesktop.org/show_bug.cgi?id=94758
> 
> 
> The patch is too large (yes, 10k lines of asm are added), I cannot
> attach the patch to this mail.
> Please find proposed patch from the above bug ticket, or just check:
> https://bugs.freedesktop.org/attachment.cgi?id=122634

Yes, the code review may be a bit difficult in this particular case.
Because it is not a completely new implementation, but a direct
conversion of the existing 32-bit ARM NEON code, it seems to be most
useful to see the AArch64 code side by side with the current 32-bit
ARM code. To do this, I have just copied the existing source files:

  pixman-arm-neon-asm-bilinear.S -> pixman-arma64-neon-asm-bilinear.S
  pixman-arm-neon-asm.S          -> pixman-arma64-neon-asm.S
  pixman-arm-neon-asm.h          -> pixman-arma64-neon-asm.h

And then applied your modifications on top of these files:
    https://cgit.freedesktop.org/~siamashka/pixman/commit/?h=20160331-arm64-review&id=1a1c6b2259bf5b3ecda96a2e2ef0e8b8b0a26c57&context=5&ignorews=1&ss=1

In general the patch looks quite good, even though some redundant MOV
instructions are added in a lot of places in this first revision.

One thing offhand. This particular fix probably belongs to a separate
small patch:

     /* stride is always multiple of 32bit units in pixman */
-    uint32_t byte_stride = stride * sizeof(uint32_t);
+    int32_t byte_stride = stride * sizeof(uint32_t);

That's a good catch. The stride may be indeed negative. And the same
fix needs to be applied to the pixman-arm-simd.c code.

> Benchmark results are attached above ticket.
> Typical result is:
> 
> normal:
>             src_n_8_x888 =  L1:  38.33  L2:  40.58  M: 39.91 ( 11.87%)
>  HT: 31.31  VT: 30.42  R: 29.14  RT: 18.14 ( 171Kops/s)
>             src_n_8_8888 =  L1:  38.37  L2:  40.61  M: 39.92 ( 11.87%)
>  HT: 31.30  VT: 30.41  R: 29.14  RT: 18.11 ( 171Kops/s)
> 
> neon:
>             src_n_8_x888 =  L1: 344.76  L2: 348.59  M:275.93 ( 80.42%)
>  HT:116.32  VT:109.72  R: 92.61  RT: 40.25 ( 348Kops/s)
>             src_n_8_8888 =  L1: 346.17  L2: 348.63  M:276.15 ( 80.48%)
>  HT:116.43  VT:109.72  R: 92.48  RT: 40.28 ( 348Kops/s)

That's a good and very much expected performance improvement. The
generic C code does not stand a chance and there is not doubt that
we want to have NEON optimizations for AArch64 too :-)

But it is much more interesting to compare the performance of this
patch with the existing 32-bit ARM NEON code. You can build the
32-bit tests programs (including lowlevel-blt-bench) using a 32-bit
ARM crosscompiler either on your desktop PC or on the ARM board:

   ./configure --host=arm-linux-gnueabihf --enable-static-testprogs \
               --disable-libpng --disable-gtk
   make

> I've only tested the code with Qualcomm DragonBoard 410c (Cortex-A53 *
> 4, 1.2GHz).
> Can anyone test it on other aarch64 platform ?

Yes, I recently got a PINE64+ board myself. It is a very similar quad
Cortex-A53, running at 1.15GHz.

When comparing the 32-bit and 64-bit NEON assembly, we need to pay
special attention to the performance of handling data sets that
are larger than CPU caches (the 'M:' numbers in the lowlevel-blt-bench
log). For simple non-scaled fast paths, the processor usually can do a
lot more number crunching calculations than the memory subsystem can
fetch and store from/to DRAM. The CPU is just doing nothing and waiting
for the data from memory.

Benchmarks on ARM Cortex-A53 @1.15GHz (PINE64 board):
  == test with PIXMAN_DISABLE=arm-neon (memcpy from glibc) ==
           src_8888_8888 =  L1:1127.00  L2: 924.24  M:215.66

  == 64-bit implementation ==
           src_8888_8888 =  L1: 523.44  L2: 451.15  M:156.74
          over_8888_8888 =  L1: 200.53  L2: 181.26  M: 89.01
            src_n_8_8888 =  L1: 329.63  L2: 332.80  M:230.29

  == 32-bit implementation ==
           src_8888_8888 =  L1: 558.72  L2: 491.83  M:236.01
          over_8888_8888 =  L1: 221.18  L2: 212.86  M:144.39
            src_n_8_8888 =  L1: 367.30  L2: 399.11  M:292.95

The 'src_8888_8888' fast path is a simple copy operation,
which is implemented using memcpy by default. And
setting the PIXMAN_DISABLE=arm-neon environment variable
can disable the NEON code for testing purposes. Based on the
benchmark numbers above, it looks like the memcpy is much
faster than the 64-bit NEON code in all cases and slightly
slower than the 32-bit NEON code for the 'M:' benchmark.

The old 32-bit processors used to have a lot of problems
with memory performance. They only supported explicit software
prefetch and needed special tricks to utilize the most out of
the memory bandwidth. Some explanations for the Cortex-A8
prefetch behaviour are available in the code comments here:
    https://cgit.freedesktop.org/pixman/tree/pixman/pixman-arm-neon-asm.h?id=pixman-0.34.0#n367

The 'advanced' prefetch type is implemented by having some
branchless ARM code, doing address calculations and running
the PLD based prefetch some distance ahead of the real
memory accesses. It can cross scanline boundaries too. And
it tries to use the LDR instruction instead of PLD in the
beginning of each new scanline, because Cortex-A8 ignored
PLD prefetches on TLB misses.

Well, nowadays modern ARM processors got automatic hardware
prefetchers, which greatly reduce or eliminate the need for
doing prefetch in software. Here is a test patch for pixman,
which is disabling software prefetch:
    https://cgit.freedesktop.org/~siamashka/pixman/commit/?h=20160401-arm64-review&id=085afbc5350269ed43dcbd0562dc762406bd26e0

And we can get some better performance with it:

  == after the patch (no prefetch) ==
           src_8888_8888 =  L1: 819.05  L2: 659.31  M:212.36
          over_8888_8888 =  L1: 232.38  L2: 210.32  M:108.26
            src_n_8_8888 =  L1: 429.21  L2: 426.69  M:300.12

Admittedly, the software prefetch would work better, if it
used a better prefetch hint ('l1keep' instead of 'l2stream').
Still the old 32-bit branchless code can't be branchless
anymore in the 64-bit mode because we lost conditional
instructions execution support.

So it looks like the prefetch either needs to be dropped in
the 64-bit code, or we can still experiment with other prefetch
strategies a bit more and find something that is more suitable.

==========

The bilinear scaling optimizations need more processing power and
are limited by the speed of the calculations done by the CPU.
That's why it is important to remove all the redundant MOV
instructions for that part of code. Comparing
    "./lowlevel-blt-bench -b src_8888_888"
results between the 32-bit and 64-bit implementations:

  == 64-bit NEON code ==
    bilinear src_8888_8888 =  L1:  53.32  L2:  52.25  M: 50.97

  == 32-bit NEON code ==
    bilinear src_8888_8888 =  L1:  78.85  L2:  77.69  M: 75.86

As the numbers above show, in your current patch the bilinear
src_8888_8888 fast path is ~1.5x slower than the old 32-bit
NEON code. This happens because of a lot of redundant instructions,
which can be removed:
    https://cgit.freedesktop.org/~siamashka/pixman/commit/?h=20160401-arm64-review&id=2f8c71416232bb714bb2420440333496b36fbfae
    https://cgit.freedesktop.org/~siamashka/pixman/commit/?h=20160401-arm64-review&id=76ad1ba645489e6f987da72a7e7f9fa3ef72141c

The redundant MOV instructions can be eliminated by making use
of UMLAL2/SHRN2 instructions (or any other instructions with
the '2' suffix in their names).

The redundant ARM shift instructions are also not necessary.
Moreover, this code can be the same in both 32-bit and 64-bit
code. I have just submitted a patch for the 32-bit NEON code,
which can reduce the difference between the 32-bit and 64-bit
code:
    https://lists.freedesktop.org/archives/pixman/2016-April/004489.html

Applying these MOV and shifts removal patches on top of
your patch makes the bilinear src_8888_8888 approximately
as fast, as the 32-bit implementation. Benchmarks on
ARM Cortex-A53 @1.15GHz (PINE64 board):
  == before the patches ==
    bilinear src_8888_8888 =  L1:  53.32  L2:  52.25  M: 50.97

  == after the patches ==
    bilinear src_8888_8888 =  L1:  77.49  L2:  75.12  M: 73.04

  == 32-bit implementation for comparison ==
    bilinear src_8888_8888 =  L1:  78.85  L2:  77.69  M: 75.86

Looks much better. Now below is the link comparing the original
32-bit NEON code (+more compatible shifts syntax) with your 64-bit
implementation (+my extra patches for redundant instructions removal):
    https://cgit.freedesktop.org/~siamashka/pixman/diff/?h=20160401-arm64-review&id2=e134f2f002cffd0e&id=20160401-arm64-review&context=5&ignorews=0&ss=1

The 'bilinear_interpolate_four_pixels_8888_8888_tail_head' macro
there shows that we can use exactly the same number of instructions
in both the 32-bit and 64-bit implementations of bilinear src_8888_8888.

===========================================

Now let's see what can be done next.

I would suggest to first take the 'pixman-arma64-neon-asm-bilinear.S'
file out of your initial patch submission in order to reduce its size
and make the review process more manageable. Such reduced patch can be
probably small enough to reach the mailing list.

At least right now it makes sense to minimize the differences between
the 32-bit and 64-bit assembly sources to keep them more in sync. And
minimizing the difference can be done from both sides. That's what the
    https://lists.freedesktop.org/archives/pixman/2016-April/004489.html
patch is trying to do on the 32-bit side.

I also see a lot of label renames in your 64-bit code. Were all of them
really necessary? Maybe we should also rename the labels in the 32-bit
assembly source file if such renames are justified?

The 64-bit patch should be probably applied on top of the 32-bit
assembly source copy, similar to how it is done in my test branch:
    https://cgit.freedesktop.org/~siamashka/pixman/log/?h=20160401-arm64-review

In the end, we probably want the 64-bit NEON code to be as fast as
the 32-bit NEON code. The lowlevel-blt-bench can be used to identify
the most significant differences. The problematic functions can be
inspected, redundant MOV and/or other instructions removed. Then
rinse and repeat...

After we are done with 'pixman-arma64-neon-asm.S', we can then
move to 'pixman-arma64-neon-asm-bilinear.S' and do something
about it too. I will not go into the details right now (we can
discuss it later), but this 'pixman-arma64-neon-asm-bilinear.S'
file is much less important and we probably don't care much
about optimizing it well.

This post has already got rather long, so I'll stop here for now.

Does anyone have any better plan about how to deal with AArch64 support?
Comments and suggestions are welcome.

-- 
Best regards,
Siarhei Siamashka