[Pixman] [PATCH] Add support for aarch64 neon optimization

Mon Apr 4 10:32:31 UTC 2016

> But it is much more interesting to compare the performance of this
> patch with the existing 32-bit ARM NEON code. You can build the
> 32-bit tests programs (including lowlevel-blt-bench) using a 32-bit
> ARM crosscompiler either on your desktop PC or on the ARM board:
>
>   ./configure --host=arm-linux-gnueabihf --enable-static-testprogs \
>               --disable-libpng --disable-gtk
>   make

Ok. I succeed to install cross-build toolchain in my Linux PC and
create armeabihf version of pixman.

And attached the benchmark result to the bug ticket. Please find:
https://bugs.freedesktop.org/attachment.cgi?id=122695

At least, I have to catch up with this result on aarch64...

On 4 April 2016 at 19:28, Mizuki Asakura <ed6e117f at gmail.com> wrote:
>> OK, thanks for this information. What I mean is that this should be
>> preferably a separate patch.
>
> I've sent it on the separated patch.
>
> On 3 April 2016 at 20:22, Siarhei Siamashka <siarhei.siamashka at gmail.com> wrote:
>> On Sat, 2 Apr 2016 20:28:44 +0900
>> Mizuki Asakura <ed6e117f at gmail.com> wrote:
>>
>>> Thanks for your reply.
>>>
>>> > The main concern is whether we can get support for all these
>>> > assembly syntax flavours, calling conventions and minor instructions
>>> > differences, while keeping the code maintainable.
>>>
>>> In the aarch64 point of view, I believe we should only support ARM's original
>>> syntax and calling convensions. It would be most portable.
>>
>> The ILP32 is an alternative ABI, which is being endorsed or even
>> developed by ARM:
>>     http://infocenter.arm.com/help/topic/com.arm.doc.dai0490a/index.html
>>
>> When/if somebody tries to use pixman in an AArch64 ILP32 system, this
>> may cause some difficulties/inconveniences for us.
>>
>>> In iOS case, they may be able to create some converter script that only
>>> translate instruction name to their own since the architcture is same
>>> (as opposed aarch32-neon vs aarch64-neon is (almost) same, but different).
>>
>> Yes. The converters exists, but AFAIK the pixman assembly code is
>> way too difficult for them to parse. Again, the problem is technically
>> solvable, just nobody has invested efforts into solving it yet. Either
>> by contributing the needed code or by sponsoring this work.
>>
>> I mentioned this only because the unified 32-bit/64-bit ARM assembly
>> preprocessor could also potentially take care of this iOS issue as a
>> side effect.
>>
>>> # How about SSE2 on OS X ? Is it same syntax and calling convension ?
>>
>> I'm not familiar with OS X myself. But the SSE2 code in pixman is
>> currently using intrinsics, so this makes everything quite simple
>> as far as portability is concerned.
>>
>> If we were to use hand tuned assembly optimizations for x86, then we
>> would have to take care of the 32-bit Linux/Windows, 64-bit Linux with
>> x32 ABI, 64-bit Linux and 64-bit Windows cases separately. Windows
>> arrived late to the x86-64 scene, but they still decided to invent
>> their own incompatible calling conventions. All of this is fine for
>> intrinsics, but may clearly cause problems for assembly implementations.
>>
>> Hand written assembly has a relatively high maintenance cost. But it
>> pays off if we want to have good performance.
>>
>> I still see a large potential for improving pixman performance on
>> x86 (replace intrinsics with hand written assembly and make a better
>> use of SSSE3, AVX, etc.). But somebody has to shoulder the expenses
>> for this to happen. BTW, I would be happy to take this task if
>> somebody is willing to pay :-)
>>
>>> >      /* stride is always multiple of 32bit units in pixman */
>>> > -    uint32_t byte_stride = stride * sizeof(uint32_t);
>>> > +    int32_t byte_stride = stride * sizeof(uint32_t);
>>>
>>> It caused horribly crashes in aarch64. Why it works on aarch32 ? :)
>>> (32-bit integer overflow may help them...)
>>
>> OK, thanks for this information. What I mean is that this should be
>> preferably a separate patch.
>>
>>> > I'm currently investigating the possibility of maybe "inventing" a
>>> > common unified syntax for some preprocessor script to take care of
>>> > both 32-bit and 64-bit ARM code. But we still can't be sure if this
>>> > approach is practical (or even doable) and whether it can solve more
>>> > problems than introduce new ones...
>>>
>>> At first, I've also tried to create a "converter script" from
>>> pixman-arm-neon-asm*.S to
>>> aarch64 neon codes just translating instruction, register names and so on.
>>
>> I see. It is nice to know that we basically came up with the same
>> solution at least as the starting point :-)
>>
>>> But I've found that this script won't work on pixman's code because of:
>>>
>>> * conflicting register names: both Qn and Dn are used in same place
>>> * highly optimized for aarch32-neon register structure. assumes d29,
>>> d30 is low / high of q15
>>
>> It's all mostly just a syntax sugar. The Dn registers still do exist
>> in a somewhat hidden way. For example:
>>
>>     vshrn.u32   d0,    q15,    #1
>>     vshrn.u32   d1,    q15,    #1
>>
>> directly translates into
>>
>>     shrn        v0.4h, v15.4s, #1
>>     shrn2       v0.8h, v15.4s, #1
>>
>> The selection of SHRN or SHRN2 instruction variant decides whether
>> the lower or upper half of the 128-bit destination register is updated.
>>
>> The 64-bit source operands (Dn registers) are handled in the very
>> same way:
>>
>>     vmull.u16   q1,    d0,    d0
>>     vmull.u16   q2,    d1,    d1
>>
>> directly translates into
>>
>>     umull       v1.4s, v0.4h, v0.4h
>>     umull2      v2.4s, v0.8h, v0.8h
>>
>> There are of course some limitations. Mixing odd and even
>> numbered d-registers as the source operands is not supported
>> in AArch64. This is a bit inconvenient, but the AArch64 code
>> still needs to take this into account either way regardless
>> of the syntax.
>>
>> I'm considering to add something like the following 'umull12'
>> fake instruction:
>>
>>     umull12     v1.4s, v0.4h, v0.8h
>>
>> which directly translates into
>>
>>     vmull.u16   q1,    d0,    d1
>>
>> So that we can write 32-bit assembly code in the AArch64 style.
>> And use then use conditional compilation (#ifdef __aarch64__)
>> for a few parts of code where we really need to have separate
>> code paths for 32-bit and 64-bit mode.
>>
>> And such fake instructions are easy to grep in the source code
>> if we do automatic 32-bit -> 64-bit conversion.
>>
>>
>> The most challenging differences are the VZIP/VUZP instructions
>> and also structure loads.
>>
>> Frankly speaking, I like the VZIP/VUZP behaviour change in AArch64.
>> The 32-bit way of handling them was a 'destructive' update of the
>> registers in-place. And for software pipelining we want to have
>> a 'non-destructive' form to move the intermediate results between
>> registers instead of updating them in-place. Fortunately pixman
>> does not use these instructions too much. They are just used for
>> conversion between 'planar' and 'packed' form via a sequence of
>> 4 instructions as explained at:
>>     https://cgit.freedesktop.org/pixman/tree/pixman/pixman-arm-neon-asm.S?id=pixman-0.34.0#n128
>>
>> This sequence of 4 instructions needs an AArch64 equivalent. Now we
>> have macros 'pixdeinterleave' and 'pixinterleave' macros:
>>    https://cgit.freedesktop.org/pixman/tree/pixman/pixman-arm-neon-asm.h?id=pixman-0.34.0#n347
>>
>> For the sake of allowing better instructions scheduling, we may
>> split them into 4 separate macros (and abstract the 32-bit/64-bit
>> differences) via something like this:
>>    pixdeinterleave_step1 ...
>>    pixdeinterleave_step2 ...
>>    pixdeinterleave_step3 ...
>>    pixdeinterleave_step4 ...
>> Then these different steps can be interleaved with other
>> instructions in the code.
>>
>> The structure load/store instructions in AArch64 handle low
>> 64-bit register halves (even numbered Dn registers). And in
>> 32-bit mode, the Dn registers are contiguous. This introduces
>> the need to re-allocate the registers a bit between the
>> 32-bit and 64-bit modes. This is actually the biggest challenge
>> for the unified syntax converter. If I can get this resolved
>> in a simple and clean way, then the job is basically done.
>>
>>
>>> * assumes integer register is 32-bit width implicitly
>>
>> All the integer registers in the NEON assembly code have named aliases.
>> So the 32-bit and 64-bit differences are already fully abstracted. We
>> have already solved this particular problem long before the AArch64
>> even came to existence :-)
>>
>>> Once the script run correctly and generate something, I have to modify
>>> all of the result.
>>> So I concluded asm sources should be separated.
>>
>> Yes, if fully automated conversion does not work yet, then at least
>> partially automated conversion makes sense. I have noticed that even
>> the code enclosed under "#if 0" has also changed, so I already
>> suspected that you used some scripts.
>>
>>> And I'm also afraid that, to generate aarch64 codes, original
>>> pixman-arm-neon-asm*.S
>>> may be modified, and it cause regressions to existing aarch32 worlds.
>>
>> Yes, the changes may be needed. But I don't expect them to be too
>> disruptive.
>>
>>> The other point of view,
>>> aarch32-neon is "fixed" because ARM won't add new instructions for aarch32
>>> and existing pixman-arm-neon-asm*.S would be completed and stable. So we should
>>> leave it as is.
>>> Then aarch64 may continue to add new features and we should follow
>>> them. So I think
>>> aarch32-neon and aarch64-neon should be separated.
>>> (Cortex-A32 ? Hmm...)
>>
>> There are still a lot of 32-bit ARM processors in use. And the
>> current 32-bit assembly code is far from complete. We can still
>> improve it. And will probably do this.
>>
>>> > So it looks like the prefetch either needs to be dropped in
>>> > the 64-bit code, or we can still experiment with other prefetch
>>> > strategies a bit more and find something that is more suitable.
>>>
>>> It's a point that I cannot decide that, should we use prefetch (none
>>> or advance),
>>> which memory to use (L1, L2, L3). Can we use L1 for this purpose ?
>>> All of aarch64 have L2 cache ?
>>> I think it should be configurable for each target CPUs.
>>
>> The pldl2strm prefetch modifier appears to be harmful for Cortex-A53.
>> The pldl1keep works better. You can try to run
>>
>>     https://github.com/ssvb/tinymembench
>>
>> on your device to get some memory related benchmark numbers (and
>> probably share this information with us). On the PINE64 board I
>> get this:
>>
>>     https://github.com/ssvb/tinymembench/wiki/PINE64-(Allwinner-A64)
>>
>> The benchmark should show which prefetch method works better. I have
>> tried to design the code so that we can support different prefetch
>> methods relatively easily. The advanced prefetch is probably a bad
>> fit for AArch64 because the condition execution is not supported
>> anymore. We can also replace the 'advanced' prefetch with some other
>> implementation on AArch64. You can try the simple-distance-ahead
>> and same-pixel-on-next-scanline prefetch strategies.
>>
>>> > + umlsl v0.4s, v2.4h, v15.h[0]
>>> > + umlal2 v0.4s, v2.8h, v15.h[0]
>>>
>>> !!!
>>> I don't know the syntax. Yes, it can reduce many unnecessaly register movings.
>>
>> The ARMv8 architecture manual has documentation for these instructions.
>> As for the GNU assembler syntax, I had to look into binutils sources in
>> order to figure out what would be the appropriate form.
>>
>>> > Moreover, this code can be the same in both 32-bit and 64-bit
>>> > code. I have just submitted a patch for the 32-bit NEON code,
>>> > which can reduce the difference between the 32-bit and 64-bit
>>> > code:
>>> >     https://lists.freedesktop.org/archives/pixman/2016-April/004489.html
>>>
>>> Yes, these "common" and "no-side-effect (to aarch32)" modification would be OK.
>>
>> Thanks! Can I have your Acked-by: or Reviewed-by: for this patch?
>>
>>> But, I'm afraid again, a patch to gain aarch64 performance, make regression
>>> for aarch32 must not be applied.
>>
>> Don't worry, we are not going to regress the AArch32 support.
>>
>>> > I would suggest to first take the 'pixman-arma64-neon-asm-bilinear.S'
>>> > file out of your initial patch submission in order to reduce its size
>>> > and make the review process more manageable. Such reduced patch can be
>>> > probably small enough to reach the mailing list.
>>>
>>> OK. I agree.
>>> At first, we should start minimal set of codes.
>>
>> Well, I don't want to make it more difficult than necessary. Submitting
>> the whole pixman-arm-neon-asm.S file in one go is fine for me. You can
>> squash my patches into your code and simply mention this in the commit
>> message with a reference to the mailing list archive. You should
>> probably also add your own copyright notice to the new assembly files.
>>
>> If you have difficulties getting rid of some redundant instructions,
>> just ask and I will try to help. We really want to have a performance
>> parity between 32-bit and 64-bit assembly code. Because this is a
>> deciding factor for people when they are considering to switch from
>> the 32-bit to 64-bit userland. The Raspberry Pi 3 is yet to make this
>> decision. And I don't want to see people starting to make claims that
>> the 64-bit mode is slow :-)
>>
>> If we don't get rid of the redundant instructions right now, then
>> 32-bit and 64-bit assembly sources may diverge and it will become
>> much more difficult to do this job later.
>>
>>> I'll also planned to disable "nearest" codes, is it OK ? (or anyone use
>>> it ? How about "0565" ?)
>>
>> It is a part of the functionality, provided by the pixman API. There is
>> no reason to drop nearest scaling or rgb565 support. I would say that we
>> should still have these optimizations available for AArch64.
>>
>> As I said above, if some of the rgb565 code is difficult to make as
>> efficient as the 32-bit variant, then I can try to help. Just fix
>> whatever you can.
>>
>>> > I also see a lot of label renames in your 64-bit code. Were all of them
>>> > really necessary? Maybe we should also rename the labels in the 32-bit
>>> > assembly source file if such renames are justified?
>>>
>>> I've met some compiler (gnu as) bug.
>>> In nested macro, "gnu-as" failed to deternine the label and it causes
>>> infinite loop.
>>
>> It might be nice to report this bug to binutils (if you haven't done
>> this yet).
>>
>>> Since aarch64 have removed many "conditional" syntax, I have to add many
>>> branch instructions especially for prefetching. So the bug occurs (or
>>> I've failed
>>> to follow the gnu-as rule).
>>> I don't want to be annoyed such difficult-to-find behavior, all label names were
>>> renamed not to cause confliction.
>>
>> We still can do the same labels rename in the 32-bit assembly source.
>> This will allow us to reduce the difference between 32-bit and 64-bit
>> assembly sources.
>>
>>> Anyway, I'll post new (simplified) patch again.
>>
>> Thanks. But maybe first try to remove as many redundant instructions
>> as possible. You can create a repository at github and temporarily
>> push the preliminary fixes there. Send a cleaned version to the mailing
>> list when you think that it is ready or substantially improved.
>>
>> You can use the http://meldmerge.org/ tool to compare the differences
>> between the 32-bit assembly source and the 64-bit assembly source on
>> your computer locally. And also edit the source code directly in it.
>>
>> --
>> Best regards,
>> Siarhei Siamashka