[Pixman] [PATCH] Add support for aarch64 neon optimization

Sun Apr 3 11:22:14 UTC 2016

On Sat, 2 Apr 2016 20:28:44 +0900
Mizuki Asakura <ed6e117f at gmail.com> wrote:

> Thanks for your reply.
> 
> > The main concern is whether we can get support for all these
> > assembly syntax flavours, calling conventions and minor instructions
> > differences, while keeping the code maintainable.  
> 
> In the aarch64 point of view, I believe we should only support ARM's original
> syntax and calling convensions. It would be most portable.

The ILP32 is an alternative ABI, which is being endorsed or even
developed by ARM:
    http://infocenter.arm.com/help/topic/com.arm.doc.dai0490a/index.html 

When/if somebody tries to use pixman in an AArch64 ILP32 system, this
may cause some difficulties/inconveniences for us.

> In iOS case, they may be able to create some converter script that only
> translate instruction name to their own since the architcture is same
> (as opposed aarch32-neon vs aarch64-neon is (almost) same, but different).

Yes. The converters exists, but AFAIK the pixman assembly code is
way too difficult for them to parse. Again, the problem is technically
solvable, just nobody has invested efforts into solving it yet. Either
by contributing the needed code or by sponsoring this work.

I mentioned this only because the unified 32-bit/64-bit ARM assembly
preprocessor could also potentially take care of this iOS issue as a
side effect.

> # How about SSE2 on OS X ? Is it same syntax and calling convension ?

I'm not familiar with OS X myself. But the SSE2 code in pixman is
currently using intrinsics, so this makes everything quite simple
as far as portability is concerned.

If we were to use hand tuned assembly optimizations for x86, then we
would have to take care of the 32-bit Linux/Windows, 64-bit Linux with
x32 ABI, 64-bit Linux and 64-bit Windows cases separately. Windows
arrived late to the x86-64 scene, but they still decided to invent
their own incompatible calling conventions. All of this is fine for
intrinsics, but may clearly cause problems for assembly implementations.

Hand written assembly has a relatively high maintenance cost. But it
pays off if we want to have good performance.

I still see a large potential for improving pixman performance on
x86 (replace intrinsics with hand written assembly and make a better
use of SSSE3, AVX, etc.). But somebody has to shoulder the expenses
for this to happen. BTW, I would be happy to take this task if
somebody is willing to pay :-)

> >      /* stride is always multiple of 32bit units in pixman */
> > -    uint32_t byte_stride = stride * sizeof(uint32_t);
> > +    int32_t byte_stride = stride * sizeof(uint32_t);  
> 
> It caused horribly crashes in aarch64. Why it works on aarch32 ? :)
> (32-bit integer overflow may help them...)

OK, thanks for this information. What I mean is that this should be
preferably a separate patch.

> > I'm currently investigating the possibility of maybe "inventing" a
> > common unified syntax for some preprocessor script to take care of
> > both 32-bit and 64-bit ARM code. But we still can't be sure if this
> > approach is practical (or even doable) and whether it can solve more
> > problems than introduce new ones...  
> 
> At first, I've also tried to create a "converter script" from
> pixman-arm-neon-asm*.S to
> aarch64 neon codes just translating instruction, register names and so on.

I see. It is nice to know that we basically came up with the same
solution at least as the starting point :-)

> But I've found that this script won't work on pixman's code because of:
> 
> * conflicting register names: both Qn and Dn are used in same place
> * highly optimized for aarch32-neon register structure. assumes d29,
> d30 is low / high of q15

It's all mostly just a syntax sugar. The Dn registers still do exist
in a somewhat hidden way. For example:

    vshrn.u32   d0,    q15,    #1
    vshrn.u32   d1,    q15,    #1

directly translates into

    shrn        v0.4h, v15.4s, #1
    shrn2       v0.8h, v15.4s, #1

The selection of SHRN or SHRN2 instruction variant decides whether
the lower or upper half of the 128-bit destination register is updated.

The 64-bit source operands (Dn registers) are handled in the very
same way:

    vmull.u16   q1,    d0,    d0
    vmull.u16   q2,    d1,    d1

directly translates into

    umull       v1.4s, v0.4h, v0.4h
    umull2      v2.4s, v0.8h, v0.8h

There are of course some limitations. Mixing odd and even
numbered d-registers as the source operands is not supported
in AArch64. This is a bit inconvenient, but the AArch64 code
still needs to take this into account either way regardless
of the syntax.

I'm considering to add something like the following 'umull12'
fake instruction:

    umull12     v1.4s, v0.4h, v0.8h

which directly translates into

    vmull.u16   q1,    d0,    d1

So that we can write 32-bit assembly code in the AArch64 style.
And use then use conditional compilation (#ifdef __aarch64__)
for a few parts of code where we really need to have separate
code paths for 32-bit and 64-bit mode.

And such fake instructions are easy to grep in the source code
if we do automatic 32-bit -> 64-bit conversion.

The most challenging differences are the VZIP/VUZP instructions
and also structure loads.

Frankly speaking, I like the VZIP/VUZP behaviour change in AArch64.
The 32-bit way of handling them was a 'destructive' update of the
registers in-place. And for software pipelining we want to have
a 'non-destructive' form to move the intermediate results between
registers instead of updating them in-place. Fortunately pixman
does not use these instructions too much. They are just used for
conversion between 'planar' and 'packed' form via a sequence of
4 instructions as explained at:
    https://cgit.freedesktop.org/pixman/tree/pixman/pixman-arm-neon-asm.S?id=pixman-0.34.0#n128

This sequence of 4 instructions needs an AArch64 equivalent. Now we
have macros 'pixdeinterleave' and 'pixinterleave' macros:
   https://cgit.freedesktop.org/pixman/tree/pixman/pixman-arm-neon-asm.h?id=pixman-0.34.0#n347

For the sake of allowing better instructions scheduling, we may
split them into 4 separate macros (and abstract the 32-bit/64-bit
differences) via something like this:
   pixdeinterleave_step1 ...
   pixdeinterleave_step2 ...
   pixdeinterleave_step3 ...
   pixdeinterleave_step4 ...
Then these different steps can be interleaved with other
instructions in the code.

The structure load/store instructions in AArch64 handle low
64-bit register halves (even numbered Dn registers). And in
32-bit mode, the Dn registers are contiguous. This introduces
the need to re-allocate the registers a bit between the
32-bit and 64-bit modes. This is actually the biggest challenge
for the unified syntax converter. If I can get this resolved
in a simple and clean way, then the job is basically done.

> * assumes integer register is 32-bit width implicitly

All the integer registers in the NEON assembly code have named aliases.
So the 32-bit and 64-bit differences are already fully abstracted. We
have already solved this particular problem long before the AArch64
even came to existence :-)

> Once the script run correctly and generate something, I have to modify
> all of the result.
> So I concluded asm sources should be separated.

Yes, if fully automated conversion does not work yet, then at least
partially automated conversion makes sense. I have noticed that even
the code enclosed under "#if 0" has also changed, so I already
suspected that you used some scripts.

> And I'm also afraid that, to generate aarch64 codes, original
> pixman-arm-neon-asm*.S
> may be modified, and it cause regressions to existing aarch32 worlds.

Yes, the changes may be needed. But I don't expect them to be too
disruptive.

> The other point of view,
> aarch32-neon is "fixed" because ARM won't add new instructions for aarch32
> and existing pixman-arm-neon-asm*.S would be completed and stable. So we should
> leave it as is.
> Then aarch64 may continue to add new features and we should follow
> them. So I think
> aarch32-neon and aarch64-neon should be separated.
> (Cortex-A32 ? Hmm...)

There are still a lot of 32-bit ARM processors in use. And the
current 32-bit assembly code is far from complete. We can still
improve it. And will probably do this.

> > So it looks like the prefetch either needs to be dropped in
> > the 64-bit code, or we can still experiment with other prefetch
> > strategies a bit more and find something that is more suitable.  
> 
> It's a point that I cannot decide that, should we use prefetch (none
> or advance),
> which memory to use (L1, L2, L3). Can we use L1 for this purpose ?
> All of aarch64 have L2 cache ?
> I think it should be configurable for each target CPUs.

The pldl2strm prefetch modifier appears to be harmful for Cortex-A53.
The pldl1keep works better. You can try to run

    https://github.com/ssvb/tinymembench

on your device to get some memory related benchmark numbers (and
probably share this information with us). On the PINE64 board I
get this:

    https://github.com/ssvb/tinymembench/wiki/PINE64-(Allwinner-A64)

The benchmark should show which prefetch method works better. I have
tried to design the code so that we can support different prefetch
methods relatively easily. The advanced prefetch is probably a bad
fit for AArch64 because the condition execution is not supported
anymore. We can also replace the 'advanced' prefetch with some other
implementation on AArch64. You can try the simple-distance-ahead
and same-pixel-on-next-scanline prefetch strategies.

> > + umlsl v0.4s, v2.4h, v15.h[0]
> > + umlal2 v0.4s, v2.8h, v15.h[0]  
> 
> !!!
> I don't know the syntax. Yes, it can reduce many unnecessaly register movings.

The ARMv8 architecture manual has documentation for these instructions.
As for the GNU assembler syntax, I had to look into binutils sources in
order to figure out what would be the appropriate form.

> > Moreover, this code can be the same in both 32-bit and 64-bit
> > code. I have just submitted a patch for the 32-bit NEON code,
> > which can reduce the difference between the 32-bit and 64-bit
> > code:
> >     https://lists.freedesktop.org/archives/pixman/2016-April/004489.html  
> 
> Yes, these "common" and "no-side-effect (to aarch32)" modification would be OK.

Thanks! Can I have your Acked-by: or Reviewed-by: for this patch?

> But, I'm afraid again, a patch to gain aarch64 performance, make regression
> for aarch32 must not be applied.

Don't worry, we are not going to regress the AArch32 support.

> > I would suggest to first take the 'pixman-arma64-neon-asm-bilinear.S'
> > file out of your initial patch submission in order to reduce its size
> > and make the review process more manageable. Such reduced patch can be
> > probably small enough to reach the mailing list.  
> 
> OK. I agree.
> At first, we should start minimal set of codes.

Well, I don't want to make it more difficult than necessary. Submitting
the whole pixman-arm-neon-asm.S file in one go is fine for me. You can
squash my patches into your code and simply mention this in the commit
message with a reference to the mailing list archive. You should
probably also add your own copyright notice to the new assembly files.

If you have difficulties getting rid of some redundant instructions,
just ask and I will try to help. We really want to have a performance
parity between 32-bit and 64-bit assembly code. Because this is a
deciding factor for people when they are considering to switch from
the 32-bit to 64-bit userland. The Raspberry Pi 3 is yet to make this
decision. And I don't want to see people starting to make claims that
the 64-bit mode is slow :-)

If we don't get rid of the redundant instructions right now, then
32-bit and 64-bit assembly sources may diverge and it will become
much more difficult to do this job later.

> I'll also planned to disable "nearest" codes, is it OK ? (or anyone use
> it ? How about "0565" ?)

It is a part of the functionality, provided by the pixman API. There is
no reason to drop nearest scaling or rgb565 support. I would say that we
should still have these optimizations available for AArch64.

As I said above, if some of the rgb565 code is difficult to make as
efficient as the 32-bit variant, then I can try to help. Just fix
whatever you can.

> > I also see a lot of label renames in your 64-bit code. Were all of them
> > really necessary? Maybe we should also rename the labels in the 32-bit
> > assembly source file if such renames are justified?  
> 
> I've met some compiler (gnu as) bug.
> In nested macro, "gnu-as" failed to deternine the label and it causes
> infinite loop.

It might be nice to report this bug to binutils (if you haven't done
this yet).

> Since aarch64 have removed many "conditional" syntax, I have to add many
> branch instructions especially for prefetching. So the bug occurs (or
> I've failed
> to follow the gnu-as rule).
> I don't want to be annoyed such difficult-to-find behavior, all label names were
> renamed not to cause confliction.

We still can do the same labels rename in the 32-bit assembly source.
This will allow us to reduce the difference between 32-bit and 64-bit
assembly sources.

> Anyway, I'll post new (simplified) patch again.

Thanks. But maybe first try to remove as many redundant instructions
as possible. You can create a repository at github and temporarily
push the preliminary fixes there. Send a cleaned version to the mailing
list when you think that it is ready or substantially improved.

You can use the http://meldmerge.org/ tool to compare the differences
between the 32-bit assembly source and the 64-bit assembly source on
your computer locally. And also edit the source code directly in it.

-- 
Best regards,
Siarhei Siamashka