[Pixman] [PATCH] Add support for aarch64 neon optimization

Sat Apr 2 11:28:44 UTC 2016

Thanks for your reply.

> The main concern is whether we can get support for all these
> assembly syntax flavours, calling conventions and minor instructions
> differences, while keeping the code maintainable.

In the aarch64 point of view, I believe we should only support ARM's original
syntax and calling convensions. It would be most portable.
In iOS case, they may be able to create some converter script that only
translate instruction name to their own since the architcture is same
(as opposed aarch32-neon vs aarch64-neon is (almost) same, but different).

# How about SSE2 on OS X ? Is it same syntax and calling convension ?

>      /* stride is always multiple of 32bit units in pixman */
> -    uint32_t byte_stride = stride * sizeof(uint32_t);
> +    int32_t byte_stride = stride * sizeof(uint32_t);

It caused horribly crashes in aarch64. Why it works on aarch32 ? :)
(32-bit integer overflow may help them...)

> I'm currently investigating the possibility of maybe "inventing" a
> common unified syntax for some preprocessor script to take care of
> both 32-bit and 64-bit ARM code. But we still can't be sure if this
> approach is practical (or even doable) and whether it can solve more
> problems than introduce new ones...

At first, I've also tried to create a "converter script" from
pixman-arm-neon-asm*.S to
aarch64 neon codes just translating instruction, register names and so on.
But I've found that this script won't work on pixman's code because of:

* conflicting register names: both Qn and Dn are used in same place
* highly optimized for aarch32-neon register structure. assumes d29,
d30 is low / high of q15
* assumes integer register is 32-bit width implicitly

Once the script run correctly and generate something, I have to modify
all of the result.
So I concluded asm sources should be separated.

And I'm also afraid that, to generate aarch64 codes, original
pixman-arm-neon-asm*.S
may be modified, and it cause regressions to existing aarch32 worlds.

The other point of view,
aarch32-neon is "fixed" because ARM won't add new instructions for aarch32
and existing pixman-arm-neon-asm*.S would be completed and stable. So we should
leave it as is.
Then aarch64 may continue to add new features and we should follow
them. So I think
aarch32-neon and aarch64-neon should be separated.
(Cortex-A32 ? Hmm...)

> So it looks like the prefetch either needs to be dropped in
> the 64-bit code, or we can still experiment with other prefetch
> strategies a bit more and find something that is more suitable.

It's a point that I cannot decide that, should we use prefetch (none
or advance),
which memory to use (L1, L2, L3). Can we use L1 for this purpose ?
All of aarch64 have L2 cache ?
I think it should be configurable for each target CPUs.

> + umlsl v0.4s, v2.4h, v15.h[0]
> + umlal2 v0.4s, v2.8h, v15.h[0]

!!!
I don't know the syntax. Yes, it can reduce many unnecessaly register movings.

> Moreover, this code can be the same in both 32-bit and 64-bit
> code. I have just submitted a patch for the 32-bit NEON code,
> which can reduce the difference between the 32-bit and 64-bit
> code:
>     https://lists.freedesktop.org/archives/pixman/2016-April/004489.html

Yes, these "common" and "no-side-effect (to aarch32)" modification would be OK.
But, I'm afraid again, a patch to gain aarch64 performance, make regression
for aarch32 must not be applied.

> I would suggest to first take the 'pixman-arma64-neon-asm-bilinear.S'
> file out of your initial patch submission in order to reduce its size
> and make the review process more manageable. Such reduced patch can be
> probably small enough to reach the mailing list.

OK. I agree.
At first, we should start minimal set of codes. I'll also planned to disable
"nearest" codes, is it OK ? (or anyone use it ? How about "0565" ?)

> I also see a lot of label renames in your 64-bit code. Were all of them
> really necessary? Maybe we should also rename the labels in the 32-bit
> assembly source file if such renames are justified?

I've met some compiler (gnu as) bug.
In nested macro, "gnu-as" failed to deternine the label and it causes
infinite loop.

Since aarch64 have removed many "conditional" syntax, I have to add many
branch instructions especially for prefetching. So the bug occurs (or
I've failed
to follow the gnu-as rule).
I don't want to be annoyed such difficult-to-find behavior, all label names were
renamed not to cause confliction.

Anyway, I'll post new (simplified) patch again.

Regards.

On 2 April 2016 at 14:31, Siarhei Siamashka <siarhei.siamashka at gmail.com> wrote:
> Hi,
>
> On Wed, 30 Mar 2016 20:47:01 +0900
> Mizuki Asakura <ed6e117f at gmail.com> wrote:
>
>> Since aarch64 has different neon syntax from aarch32 and has no
>> support of (older) arm-simd,
>> there are no SIMD accelerations for pixman on aarch64.
>>
>> We need new implementations.
>
> Thanks for starting this work. That's an excellent news. Some very
> affordable and quite capable 64-bit ARM development boards (PINE64
> and ODROID-C2) became available since the beginning of March 2016.
> Their users clearly need pixman optimizations for running Linux
> desktop faster on 64-bit systems. If Raspberry Pi 3 gets 64-bit
> software support, then their users will need these optimizations too.
>
> For the last week or so I have been also thinking about what would be
> the best way to add 64-bit ARM assembly support to pixman. Indeed the
> NEON assembly syntax has changed quite radically in binutils compared
> to the 32-bit ARM variant. While the NEON instructions themselves
> remain mostly the same, with just a doubled number of registers and
> some incompatible changes affecting a few instructions.
>
> There were also some recurring requests asking for iOS support
> in the past. The assembler in iOS happens to have a somewhat
> incompatible syntax too:
>     https://lists.freedesktop.org/archives/pixman/2011-March/001110.html
>
> And the ILP32 ABI variant for AArch64 (something similar to the x32
> ABI) may potentially make our life harder in the future.
>
> The main concern is whether we can get support for all these
> assembly syntax flavours, calling conventions and minor instructions
> differences, while keeping the code maintainable. Preferably we should
> avoid doing a lot of copy-paste job. Replicating the same changes to
> multiple source files is rather inconvenient.
>
> Yes, the compiler people are always advocating the use of intrinsics
> as a universal solution. But the compiler generated code is typically
> not very good and some performance is lost.
>
> I'm currently investigating the possibility of maybe "inventing" a
> common unified syntax for some preprocessor script to take care of
> both 32-bit and 64-bit ARM code. But we still can't be sure if this
> approach is practical (or even doable) and whether it can solve more
> problems than introduce new ones...
>
> Still as a short term solution (and we know that "temporary" solutions
> sometimes end up being rather permanent), introducing separate assembly
> source files for AArch64 is quite reasonable now. So your patch is very
> handy and comes exactly at the right time.
>
>> Added: https://bugs.freedesktop.org/show_bug.cgi?id=94758
>>
>>
>> The patch is too large (yes, 10k lines of asm are added), I cannot
>> attach the patch to this mail.
>> Please find proposed patch from the above bug ticket, or just check:
>> https://bugs.freedesktop.org/attachment.cgi?id=122634
>
> Yes, the code review may be a bit difficult in this particular case.
> Because it is not a completely new implementation, but a direct
> conversion of the existing 32-bit ARM NEON code, it seems to be most
> useful to see the AArch64 code side by side with the current 32-bit
> ARM code. To do this, I have just copied the existing source files:
>
>   pixman-arm-neon-asm-bilinear.S -> pixman-arma64-neon-asm-bilinear.S
>   pixman-arm-neon-asm.S          -> pixman-arma64-neon-asm.S
>   pixman-arm-neon-asm.h          -> pixman-arma64-neon-asm.h
>
> And then applied your modifications on top of these files:
>     https://cgit.freedesktop.org/~siamashka/pixman/commit/?h=20160331-arm64-review&id=1a1c6b2259bf5b3ecda96a2e2ef0e8b8b0a26c57&context=5&ignorews=1&ss=1
>
> In general the patch looks quite good, even though some redundant MOV
> instructions are added in a lot of places in this first revision.
>
> One thing offhand. This particular fix probably belongs to a separate
> small patch:
>
>      /* stride is always multiple of 32bit units in pixman */
> -    uint32_t byte_stride = stride * sizeof(uint32_t);
> +    int32_t byte_stride = stride * sizeof(uint32_t);
>
> That's a good catch. The stride may be indeed negative. And the same
> fix needs to be applied to the pixman-arm-simd.c code.
>
>> Benchmark results are attached above ticket.
>> Typical result is:
>>
>> normal:
>>             src_n_8_x888 =  L1:  38.33  L2:  40.58  M: 39.91 ( 11.87%)
>>  HT: 31.31  VT: 30.42  R: 29.14  RT: 18.14 ( 171Kops/s)
>>             src_n_8_8888 =  L1:  38.37  L2:  40.61  M: 39.92 ( 11.87%)
>>  HT: 31.30  VT: 30.41  R: 29.14  RT: 18.11 ( 171Kops/s)
>>
>> neon:
>>             src_n_8_x888 =  L1: 344.76  L2: 348.59  M:275.93 ( 80.42%)
>>  HT:116.32  VT:109.72  R: 92.61  RT: 40.25 ( 348Kops/s)
>>             src_n_8_8888 =  L1: 346.17  L2: 348.63  M:276.15 ( 80.48%)
>>  HT:116.43  VT:109.72  R: 92.48  RT: 40.28 ( 348Kops/s)
>
> That's a good and very much expected performance improvement. The
> generic C code does not stand a chance and there is not doubt that
> we want to have NEON optimizations for AArch64 too :-)
>
> But it is much more interesting to compare the performance of this
> patch with the existing 32-bit ARM NEON code. You can build the
> 32-bit tests programs (including lowlevel-blt-bench) using a 32-bit
> ARM crosscompiler either on your desktop PC or on the ARM board:
>
>    ./configure --host=arm-linux-gnueabihf --enable-static-testprogs \
>                --disable-libpng --disable-gtk
>    make
>
>> I've only tested the code with Qualcomm DragonBoard 410c (Cortex-A53 *
>> 4, 1.2GHz).
>> Can anyone test it on other aarch64 platform ?
>
> Yes, I recently got a PINE64+ board myself. It is a very similar quad
> Cortex-A53, running at 1.15GHz.
>
> When comparing the 32-bit and 64-bit NEON assembly, we need to pay
> special attention to the performance of handling data sets that
> are larger than CPU caches (the 'M:' numbers in the lowlevel-blt-bench
> log). For simple non-scaled fast paths, the processor usually can do a
> lot more number crunching calculations than the memory subsystem can
> fetch and store from/to DRAM. The CPU is just doing nothing and waiting
> for the data from memory.
>
> Benchmarks on ARM Cortex-A53 @1.15GHz (PINE64 board):
>   == test with PIXMAN_DISABLE=arm-neon (memcpy from glibc) ==
>            src_8888_8888 =  L1:1127.00  L2: 924.24  M:215.66
>
>   == 64-bit implementation ==
>            src_8888_8888 =  L1: 523.44  L2: 451.15  M:156.74
>           over_8888_8888 =  L1: 200.53  L2: 181.26  M: 89.01
>             src_n_8_8888 =  L1: 329.63  L2: 332.80  M:230.29
>
>   == 32-bit implementation ==
>            src_8888_8888 =  L1: 558.72  L2: 491.83  M:236.01
>           over_8888_8888 =  L1: 221.18  L2: 212.86  M:144.39
>             src_n_8_8888 =  L1: 367.30  L2: 399.11  M:292.95
>
>
> The 'src_8888_8888' fast path is a simple copy operation,
> which is implemented using memcpy by default. And
> setting the PIXMAN_DISABLE=arm-neon environment variable
> can disable the NEON code for testing purposes. Based on the
> benchmark numbers above, it looks like the memcpy is much
> faster than the 64-bit NEON code in all cases and slightly
> slower than the 32-bit NEON code for the 'M:' benchmark.
>
> The old 32-bit processors used to have a lot of problems
> with memory performance. They only supported explicit software
> prefetch and needed special tricks to utilize the most out of
> the memory bandwidth. Some explanations for the Cortex-A8
> prefetch behaviour are available in the code comments here:
>     https://cgit.freedesktop.org/pixman/tree/pixman/pixman-arm-neon-asm.h?id=pixman-0.34.0#n367
>
> The 'advanced' prefetch type is implemented by having some
> branchless ARM code, doing address calculations and running
> the PLD based prefetch some distance ahead of the real
> memory accesses. It can cross scanline boundaries too. And
> it tries to use the LDR instruction instead of PLD in the
> beginning of each new scanline, because Cortex-A8 ignored
> PLD prefetches on TLB misses.
>
> Well, nowadays modern ARM processors got automatic hardware
> prefetchers, which greatly reduce or eliminate the need for
> doing prefetch in software. Here is a test patch for pixman,
> which is disabling software prefetch:
>     https://cgit.freedesktop.org/~siamashka/pixman/commit/?h=20160401-arm64-review&id=085afbc5350269ed43dcbd0562dc762406bd26e0
>
> And we can get some better performance with it:
>
>   == after the patch (no prefetch) ==
>            src_8888_8888 =  L1: 819.05  L2: 659.31  M:212.36
>           over_8888_8888 =  L1: 232.38  L2: 210.32  M:108.26
>             src_n_8_8888 =  L1: 429.21  L2: 426.69  M:300.12
>
> Admittedly, the software prefetch would work better, if it
> used a better prefetch hint ('l1keep' instead of 'l2stream').
> Still the old 32-bit branchless code can't be branchless
> anymore in the 64-bit mode because we lost conditional
> instructions execution support.
>
> So it looks like the prefetch either needs to be dropped in
> the 64-bit code, or we can still experiment with other prefetch
> strategies a bit more and find something that is more suitable.
>
> ==========
>
> The bilinear scaling optimizations need more processing power and
> are limited by the speed of the calculations done by the CPU.
> That's why it is important to remove all the redundant MOV
> instructions for that part of code. Comparing
>     "./lowlevel-blt-bench -b src_8888_888"
> results between the 32-bit and 64-bit implementations:
>
>   == 64-bit NEON code ==
>     bilinear src_8888_8888 =  L1:  53.32  L2:  52.25  M: 50.97
>
>   == 32-bit NEON code ==
>     bilinear src_8888_8888 =  L1:  78.85  L2:  77.69  M: 75.86
>
> As the numbers above show, in your current patch the bilinear
> src_8888_8888 fast path is ~1.5x slower than the old 32-bit
> NEON code. This happens because of a lot of redundant instructions,
> which can be removed:
>     https://cgit.freedesktop.org/~siamashka/pixman/commit/?h=20160401-arm64-review&id=2f8c71416232bb714bb2420440333496b36fbfae
>     https://cgit.freedesktop.org/~siamashka/pixman/commit/?h=20160401-arm64-review&id=76ad1ba645489e6f987da72a7e7f9fa3ef72141c
>
> The redundant MOV instructions can be eliminated by making use
> of UMLAL2/SHRN2 instructions (or any other instructions with
> the '2' suffix in their names).
>
> The redundant ARM shift instructions are also not necessary.
> Moreover, this code can be the same in both 32-bit and 64-bit
> code. I have just submitted a patch for the 32-bit NEON code,
> which can reduce the difference between the 32-bit and 64-bit
> code:
>     https://lists.freedesktop.org/archives/pixman/2016-April/004489.html
>
> Applying these MOV and shifts removal patches on top of
> your patch makes the bilinear src_8888_8888 approximately
> as fast, as the 32-bit implementation. Benchmarks on
> ARM Cortex-A53 @1.15GHz (PINE64 board):
>   == before the patches ==
>     bilinear src_8888_8888 =  L1:  53.32  L2:  52.25  M: 50.97
>
>   == after the patches ==
>     bilinear src_8888_8888 =  L1:  77.49  L2:  75.12  M: 73.04
>
>   == 32-bit implementation for comparison ==
>     bilinear src_8888_8888 =  L1:  78.85  L2:  77.69  M: 75.86
>
>
> Looks much better. Now below is the link comparing the original
> 32-bit NEON code (+more compatible shifts syntax) with your 64-bit
> implementation (+my extra patches for redundant instructions removal):
>     https://cgit.freedesktop.org/~siamashka/pixman/diff/?h=20160401-arm64-review&id2=e134f2f002cffd0e&id=20160401-arm64-review&context=5&ignorews=0&ss=1
>
> The 'bilinear_interpolate_four_pixels_8888_8888_tail_head' macro
> there shows that we can use exactly the same number of instructions
> in both the 32-bit and 64-bit implementations of bilinear src_8888_8888.
>
> ===========================================
>
> Now let's see what can be done next.
>
> I would suggest to first take the 'pixman-arma64-neon-asm-bilinear.S'
> file out of your initial patch submission in order to reduce its size
> and make the review process more manageable. Such reduced patch can be
> probably small enough to reach the mailing list.
>
> At least right now it makes sense to minimize the differences between
> the 32-bit and 64-bit assembly sources to keep them more in sync. And
> minimizing the difference can be done from both sides. That's what the
>     https://lists.freedesktop.org/archives/pixman/2016-April/004489.html
> patch is trying to do on the 32-bit side.
>
> I also see a lot of label renames in your 64-bit code. Were all of them
> really necessary? Maybe we should also rename the labels in the 32-bit
> assembly source file if such renames are justified?
>
> The 64-bit patch should be probably applied on top of the 32-bit
> assembly source copy, similar to how it is done in my test branch:
>     https://cgit.freedesktop.org/~siamashka/pixman/log/?h=20160401-arm64-review
>
> In the end, we probably want the 64-bit NEON code to be as fast as
> the 32-bit NEON code. The lowlevel-blt-bench can be used to identify
> the most significant differences. The problematic functions can be
> inspected, redundant MOV and/or other instructions removed. Then
> rinse and repeat...
>
> After we are done with 'pixman-arma64-neon-asm.S', we can then
> move to 'pixman-arma64-neon-asm-bilinear.S' and do something
> about it too. I will not go into the details right now (we can
> discuss it later), but this 'pixman-arma64-neon-asm-bilinear.S'
> file is much less important and we probably don't care much
> about optimizing it well.
>
> This post has already got rather long, so I'll stop here for now.
>
> Does anyone have any better plan about how to deal with AArch64 support?
> Comments and suggestions are welcome.
>
> --
> Best regards,
> Siarhei Siamashka