[Pixman] [PATCH] Add support for aarch64 neon optimization

Thu Apr 7 07:31:02 UTC 2016

On Tue, 5 Apr 2016 20:20:54 +0900
Mizuki Asakura <ed6e117f at gmail.com> wrote:

> > This code is not just there for prefetching. It is an example of
> > using software pipelining:  
> 
> OK. I understand.
> But the code is very hard to maintain... I've met too many register
> conflictions.

The *_tail_head variant has exactly the same code as the individual
*_head and *_tail macros, but the instructions are just reordered.
There should be no additional register clashes if you are doing the
exact 1-to-1 conversion.

If you are considering to modify the algorithm, then now it's better
not to touch these problematic parts of code and keep them the way
they were in your first patch.

> # q2 and d2 were used in a same sequence. It cannot be exist in aarch64-neon.

Why not? The registers mapping is something like this:
   q2 -> v2.16b
   d2 -> v1.8b  (because d2 is the lower 64-bit half of the 128-bit q1 register)

Do you have a more specific example of a code fragment that needs
conversion?

> Anyway, I'll try to remove unnecessary register copies as you've suggested.
> After that, I'll also tryh to make benchmarks that
> * advance vs none
> * L1 / L2 / L3 (Cortex-A53 doesn't have), keep / strm
> to find the better configuration.

OK, thanks. Just don't overwork yourself. What I suggested was only
fixing a few very obvious and trivial things in the next revision of
your patch. Then I could have a look at what is remaining and maybe
have some ideas about how to fix it (or maybe not).

The benchmark against the 32-bit code is useful for prioritizing
this work (pay more attention to the things that have slowed down
the most).

But I think that your patch is already almost good enough. And it is
definitely very useful for the users of AArch64 hardware, so we
probably want to have it applied and released as soon as possible.

> But it is only a result of Cortex-A53 (that you ane me have). Does anyone can
> test other (expensive :) aarch64 environment ?
> (Cortex-Axx, Apple Ax, NVidia Denver, etc, etc...)

I have a ssh access to Cavium ThunderX and APM X-Gene. Both of these
are "server" ARM processors and taking care of graphics/multimedia is
not their primary task.

The ThunderX is a 48-core processor with very small and simple
individual cores, optimized for reducing the number of transistors.
It even does not implement the 32-bit mode at all. So we can't
compare the performance of the 32-bit and the 64-bit pixman code
on it. ThunderX has a 64-bit NEON data path. Moreover, it has a
particularly bad microcoded implementation of some NEON instructions,
for example the TBL instruction needs 320 (!) cycles:
    https://gcc.gnu.org/ml/gcc-patches/2015-06/msg01676.html

The X-Gene is reasonably fast out-of-order processor with a wide
instructions decoder, which can run normal code reasonably fast.
However it also only has a 64-bit NEON data path.

A low power, but more multimedia oriented Cortex-A53 with a full
128-bit NEON data path is faster than either of these when running
NEON code.

I can try to run the lowlevel-blt-bench on X-Gene and provide the
32-bit and 64-bit logs. However I'm not the only user of that
machine and running the benchmark undisturbed may be problematic.

Either way, we are very likely just going to see that reducing the
number of redundant instructions has a positive impact on performance.
In a pretty much similar way as on Cortex-A53.

-- 
Best regards,
Siarhei Siamashka