[Pixman] [PATCH] Add support for aarch64 neon optimization

Thu Apr 7 10:45:03 UTC 2016

> Do you have a more specific example of a code fragment that needs
> conversion?

In original pixman-arm-neon-asm.S:

.macro pixman_composite_over_8888_8_0565_process_pixblock_head
...
vsli.u16    q2,  q2, #5
...
vraddhn.u16 d2,  q6,  q10
...
vshrn.u16   d30, q2, #2

If all registers just converted to Vn, it would be as follows:

.macro pixman_composite_over_8888_8_0565_process_pixblock_head
...
sli    v2.8h,  v2.8h, #5
...
raddhn v2.8b,  v6.8h,  v10.8h
...
shrn   v30.8b, v2.8h, #2

The second raddhn corrupts v2, then the next shrn v30.8b, v2.8h #2
would not be correct.

There are many other conflicts I've met.
I didn't find any specification on the ARM's document that
Dn can be a lower part of V(n/2).

On 7 April 2016 at 16:31, Siarhei Siamashka <siarhei.siamashka at gmail.com> wrote:
> On Tue, 5 Apr 2016 20:20:54 +0900
> Mizuki Asakura <ed6e117f at gmail.com> wrote:
>
>> > This code is not just there for prefetching. It is an example of
>> > using software pipelining:
>>
>> OK. I understand.
>> But the code is very hard to maintain... I've met too many register
>> conflictions.
>
> The *_tail_head variant has exactly the same code as the individual
> *_head and *_tail macros, but the instructions are just reordered.
> There should be no additional register clashes if you are doing the
> exact 1-to-1 conversion.
>
> If you are considering to modify the algorithm, then now it's better
> not to touch these problematic parts of code and keep them the way
> they were in your first patch.
>
>> # q2 and d2 were used in a same sequence. It cannot be exist in aarch64-neon.
>
> Why not? The registers mapping is something like this:
>    q2 -> v2.16b
>    d2 -> v1.8b  (because d2 is the lower 64-bit half of the 128-bit q1 register)
>
> Do you have a more specific example of a code fragment that needs
> conversion?
>
>> Anyway, I'll try to remove unnecessary register copies as you've suggested.
>> After that, I'll also tryh to make benchmarks that
>> * advance vs none
>> * L1 / L2 / L3 (Cortex-A53 doesn't have), keep / strm
>> to find the better configuration.
>
> OK, thanks. Just don't overwork yourself. What I suggested was only
> fixing a few very obvious and trivial things in the next revision of
> your patch. Then I could have a look at what is remaining and maybe
> have some ideas about how to fix it (or maybe not).
>
> The benchmark against the 32-bit code is useful for prioritizing
> this work (pay more attention to the things that have slowed down
> the most).
>
> But I think that your patch is already almost good enough. And it is
> definitely very useful for the users of AArch64 hardware, so we
> probably want to have it applied and released as soon as possible.
>
>> But it is only a result of Cortex-A53 (that you ane me have). Does anyone can
>> test other (expensive :) aarch64 environment ?
>> (Cortex-Axx, Apple Ax, NVidia Denver, etc, etc...)
>
> I have a ssh access to Cavium ThunderX and APM X-Gene. Both of these
> are "server" ARM processors and taking care of graphics/multimedia is
> not their primary task.
>
> The ThunderX is a 48-core processor with very small and simple
> individual cores, optimized for reducing the number of transistors.
> It even does not implement the 32-bit mode at all. So we can't
> compare the performance of the 32-bit and the 64-bit pixman code
> on it. ThunderX has a 64-bit NEON data path. Moreover, it has a
> particularly bad microcoded implementation of some NEON instructions,
> for example the TBL instruction needs 320 (!) cycles:
>     https://gcc.gnu.org/ml/gcc-patches/2015-06/msg01676.html
>
> The X-Gene is reasonably fast out-of-order processor with a wide
> instructions decoder, which can run normal code reasonably fast.
> However it also only has a 64-bit NEON data path.
>
> A low power, but more multimedia oriented Cortex-A53 with a full
> 128-bit NEON data path is faster than either of these when running
> NEON code.
>
> I can try to run the lowlevel-blt-bench on X-Gene and provide the
> 32-bit and 64-bit logs. However I'm not the only user of that
> machine and running the benchmark undisturbed may be problematic.
>
> Either way, we are very likely just going to see that reducing the
> number of redundant instructions has a positive impact on performance.
> In a pretty much similar way as on Cortex-A53.
>
> --
> Best regards,
> Siarhei Siamashka