[Pixman] [PATCH] Add support for aarch64 neon optimization

Mon Apr 4 10:28:54 UTC 2016

> OK, thanks for this information. What I mean is that this should be
> preferably a separate patch.

I've sent it on the separated patch.

On 3 April 2016 at 20:22, Siarhei Siamashka <siarhei.siamashka at gmail.com> wrote:
> On Sat, 2 Apr 2016 20:28:44 +0900
> Mizuki Asakura <ed6e117f at gmail.com> wrote:
>
>> Thanks for your reply.
>>
>> > The main concern is whether we can get support for all these
>> > assembly syntax flavours, calling conventions and minor instructions
>> > differences, while keeping the code maintainable.
>>
>> In the aarch64 point of view, I believe we should only support ARM's original
>> syntax and calling convensions. It would be most portable.
>
> The ILP32 is an alternative ABI, which is being endorsed or even
> developed by ARM:
>     http://infocenter.arm.com/help/topic/com.arm.doc.dai0490a/index.html
>
> When/if somebody tries to use pixman in an AArch64 ILP32 system, this
> may cause some difficulties/inconveniences for us.
>
>> In iOS case, they may be able to create some converter script that only
>> translate instruction name to their own since the architcture is same
>> (as opposed aarch32-neon vs aarch64-neon is (almost) same, but different).
>
> Yes. The converters exists, but AFAIK the pixman assembly code is
> way too difficult for them to parse. Again, the problem is technically
> solvable, just nobody has invested efforts into solving it yet. Either
> by contributing the needed code or by sponsoring this work.
>
> I mentioned this only because the unified 32-bit/64-bit ARM assembly
> preprocessor could also potentially take care of this iOS issue as a
> side effect.
>
>> # How about SSE2 on OS X ? Is it same syntax and calling convension ?
>
> I'm not familiar with OS X myself. But the SSE2 code in pixman is
> currently using intrinsics, so this makes everything quite simple
> as far as portability is concerned.
>
> If we were to use hand tuned assembly optimizations for x86, then we
> would have to take care of the 32-bit Linux/Windows, 64-bit Linux with
> x32 ABI, 64-bit Linux and 64-bit Windows cases separately. Windows
> arrived late to the x86-64 scene, but they still decided to invent
> their own incompatible calling conventions. All of this is fine for
> intrinsics, but may clearly cause problems for assembly implementations.
>
> Hand written assembly has a relatively high maintenance cost. But it
> pays off if we want to have good performance.
>
> I still see a large potential for improving pixman performance on
> x86 (replace intrinsics with hand written assembly and make a better
> use of SSSE3, AVX, etc.). But somebody has to shoulder the expenses
> for this to happen. BTW, I would be happy to take this task if
> somebody is willing to pay :-)
>
>> >      /* stride is always multiple of 32bit units in pixman */
>> > -    uint32_t byte_stride = stride * sizeof(uint32_t);
>> > +    int32_t byte_stride = stride * sizeof(uint32_t);
>>
>> It caused horribly crashes in aarch64. Why it works on aarch32 ? :)
>> (32-bit integer overflow may help them...)
>
> OK, thanks for this information. What I mean is that this should be
> preferably a separate patch.
>
>> > I'm currently investigating the possibility of maybe "inventing" a
>> > common unified syntax for some preprocessor script to take care of
>> > both 32-bit and 64-bit ARM code. But we still can't be sure if this
>> > approach is practical (or even doable) and whether it can solve more
>> > problems than introduce new ones...
>>
>> At first, I've also tried to create a "converter script" from
>> pixman-arm-neon-asm*.S to
>> aarch64 neon codes just translating instruction, register names and so on.
>
> I see. It is nice to know that we basically came up with the same
> solution at least as the starting point :-)
>
>> But I've found that this script won't work on pixman's code because of:
>>
>> * conflicting register names: both Qn and Dn are used in same place
>> * highly optimized for aarch32-neon register structure. assumes d29,
>> d30 is low / high of q15
>
> It's all mostly just a syntax sugar. The Dn registers still do exist
> in a somewhat hidden way. For example:
>
>     vshrn.u32   d0,    q15,    #1
>     vshrn.u32   d1,    q15,    #1
>
> directly translates into
>
>     shrn        v0.4h, v15.4s, #1
>     shrn2       v0.8h, v15.4s, #1
>
> The selection of SHRN or SHRN2 instruction variant decides whether
> the lower or upper half of the 128-bit destination register is updated.
>
> The 64-bit source operands (Dn registers) are handled in the very
> same way:
>
>     vmull.u16   q1,    d0,    d0
>     vmull.u16   q2,    d1,    d1
>
> directly translates into
>
>     umull       v1.4s, v0.4h, v0.4h
>     umull2      v2.4s, v0.8h, v0.8h
>
> There are of course some limitations. Mixing odd and even
> numbered d-registers as the source operands is not supported
> in AArch64. This is a bit inconvenient, but the AArch64 code
> still needs to take this into account either way regardless
> of the syntax.
>
> I'm considering to add something like the following 'umull12'
> fake instruction:
>
>     umull12     v1.4s, v0.4h, v0.8h
>
> which directly translates into
>
>     vmull.u16   q1,    d0,    d1
>
> So that we can write 32-bit assembly code in the AArch64 style.
> And use then use conditional compilation (#ifdef __aarch64__)
> for a few parts of code where we really need to have separate
> code paths for 32-bit and 64-bit mode.
>
> And such fake instructions are easy to grep in the source code
> if we do automatic 32-bit -> 64-bit conversion.
>
>
> The most challenging differences are the VZIP/VUZP instructions
> and also structure loads.
>
> Frankly speaking, I like the VZIP/VUZP behaviour change in AArch64.
> The 32-bit way of handling them was a 'destructive' update of the
> registers in-place. And for software pipelining we want to have
> a 'non-destructive' form to move the intermediate results between
> registers instead of updating them in-place. Fortunately pixman
> does not use these instructions too much. They are just used for
> conversion between 'planar' and 'packed' form via a sequence of
> 4 instructions as explained at:
>     https://cgit.freedesktop.org/pixman/tree/pixman/pixman-arm-neon-asm.S?id=pixman-0.34.0#n128
>
> This sequence of 4 instructions needs an AArch64 equivalent. Now we
> have macros 'pixdeinterleave' and 'pixinterleave' macros:
>    https://cgit.freedesktop.org/pixman/tree/pixman/pixman-arm-neon-asm.h?id=pixman-0.34.0#n347
>
> For the sake of allowing better instructions scheduling, we may
> split them into 4 separate macros (and abstract the 32-bit/64-bit
> differences) via something like this:
>    pixdeinterleave_step1 ...
>    pixdeinterleave_step2 ...
>    pixdeinterleave_step3 ...
>    pixdeinterleave_step4 ...
> Then these different steps can be interleaved with other
> instructions in the code.
>
> The structure load/store instructions in AArch64 handle low
> 64-bit register halves (even numbered Dn registers). And in
> 32-bit mode, the Dn registers are contiguous. This introduces
> the need to re-allocate the registers a bit between the
> 32-bit and 64-bit modes. This is actually the biggest challenge
> for the unified syntax converter. If I can get this resolved
> in a simple and clean way, then the job is basically done.
>
>
>> * assumes integer register is 32-bit width implicitly
>
> All the integer registers in the NEON assembly code have named aliases.
> So the 32-bit and 64-bit differences are already fully abstracted. We
> have already solved this particular problem long before the AArch64
> even came to existence :-)
>
>> Once the script run correctly and generate something, I have to modify
>> all of the result.
>> So I concluded asm sources should be separated.
>
> Yes, if fully automated conversion does not work yet, then at least
> partially automated conversion makes sense. I have noticed that even
> the code enclosed under "#if 0" has also changed, so I already
> suspected that you used some scripts.
>
>> And I'm also afraid that, to generate aarch64 codes, original
>> pixman-arm-neon-asm*.S
>> may be modified, and it cause regressions to existing aarch32 worlds.
>
> Yes, the changes may be needed. But I don't expect them to be too
> disruptive.
>
>> The other point of view,
>> aarch32-neon is "fixed" because ARM won't add new instructions for aarch32
>> and existing pixman-arm-neon-asm*.S would be completed and stable. So we should
>> leave it as is.
>> Then aarch64 may continue to add new features and we should follow
>> them. So I think
>> aarch32-neon and aarch64-neon should be separated.
>> (Cortex-A32 ? Hmm...)
>
> There are still a lot of 32-bit ARM processors in use. And the
> current 32-bit assembly code is far from complete. We can still
> improve it. And will probably do this.
>
>> > So it looks like the prefetch either needs to be dropped in
>> > the 64-bit code, or we can still experiment with other prefetch
>> > strategies a bit more and find something that is more suitable.
>>
>> It's a point that I cannot decide that, should we use prefetch (none
>> or advance),
>> which memory to use (L1, L2, L3). Can we use L1 for this purpose ?
>> All of aarch64 have L2 cache ?
>> I think it should be configurable for each target CPUs.
>
> The pldl2strm prefetch modifier appears to be harmful for Cortex-A53.
> The pldl1keep works better. You can try to run
>
>     https://github.com/ssvb/tinymembench
>
> on your device to get some memory related benchmark numbers (and
> probably share this information with us). On the PINE64 board I
> get this:
>
>     https://github.com/ssvb/tinymembench/wiki/PINE64-(Allwinner-A64)
>
> The benchmark should show which prefetch method works better. I have
> tried to design the code so that we can support different prefetch
> methods relatively easily. The advanced prefetch is probably a bad
> fit for AArch64 because the condition execution is not supported
> anymore. We can also replace the 'advanced' prefetch with some other
> implementation on AArch64. You can try the simple-distance-ahead
> and same-pixel-on-next-scanline prefetch strategies.
>
>> > + umlsl v0.4s, v2.4h, v15.h[0]
>> > + umlal2 v0.4s, v2.8h, v15.h[0]
>>
>> !!!
>> I don't know the syntax. Yes, it can reduce many unnecessaly register movings.
>
> The ARMv8 architecture manual has documentation for these instructions.
> As for the GNU assembler syntax, I had to look into binutils sources in
> order to figure out what would be the appropriate form.
>
>> > Moreover, this code can be the same in both 32-bit and 64-bit
>> > code. I have just submitted a patch for the 32-bit NEON code,
>> > which can reduce the difference between the 32-bit and 64-bit
>> > code:
>> >     https://lists.freedesktop.org/archives/pixman/2016-April/004489.html
>>
>> Yes, these "common" and "no-side-effect (to aarch32)" modification would be OK.
>
> Thanks! Can I have your Acked-by: or Reviewed-by: for this patch?
>
>> But, I'm afraid again, a patch to gain aarch64 performance, make regression
>> for aarch32 must not be applied.
>
> Don't worry, we are not going to regress the AArch32 support.
>
>> > I would suggest to first take the 'pixman-arma64-neon-asm-bilinear.S'
>> > file out of your initial patch submission in order to reduce its size
>> > and make the review process more manageable. Such reduced patch can be
>> > probably small enough to reach the mailing list.
>>
>> OK. I agree.
>> At first, we should start minimal set of codes.
>
> Well, I don't want to make it more difficult than necessary. Submitting
> the whole pixman-arm-neon-asm.S file in one go is fine for me. You can
> squash my patches into your code and simply mention this in the commit
> message with a reference to the mailing list archive. You should
> probably also add your own copyright notice to the new assembly files.
>
> If you have difficulties getting rid of some redundant instructions,
> just ask and I will try to help. We really want to have a performance
> parity between 32-bit and 64-bit assembly code. Because this is a
> deciding factor for people when they are considering to switch from
> the 32-bit to 64-bit userland. The Raspberry Pi 3 is yet to make this
> decision. And I don't want to see people starting to make claims that
> the 64-bit mode is slow :-)
>
> If we don't get rid of the redundant instructions right now, then
> 32-bit and 64-bit assembly sources may diverge and it will become
> much more difficult to do this job later.
>
>> I'll also planned to disable "nearest" codes, is it OK ? (or anyone use
>> it ? How about "0565" ?)
>
> It is a part of the functionality, provided by the pixman API. There is
> no reason to drop nearest scaling or rgb565 support. I would say that we
> should still have these optimizations available for AArch64.
>
> As I said above, if some of the rgb565 code is difficult to make as
> efficient as the 32-bit variant, then I can try to help. Just fix
> whatever you can.
>
>> > I also see a lot of label renames in your 64-bit code. Were all of them
>> > really necessary? Maybe we should also rename the labels in the 32-bit
>> > assembly source file if such renames are justified?
>>
>> I've met some compiler (gnu as) bug.
>> In nested macro, "gnu-as" failed to deternine the label and it causes
>> infinite loop.
>
> It might be nice to report this bug to binutils (if you haven't done
> this yet).
>
>> Since aarch64 have removed many "conditional" syntax, I have to add many
>> branch instructions especially for prefetching. So the bug occurs (or
>> I've failed
>> to follow the gnu-as rule).
>> I don't want to be annoyed such difficult-to-find behavior, all label names were
>> renamed not to cause confliction.
>
> We still can do the same labels rename in the 32-bit assembly source.
> This will allow us to reduce the difference between 32-bit and 64-bit
> assembly sources.
>
>> Anyway, I'll post new (simplified) patch again.
>
> Thanks. But maybe first try to remove as many redundant instructions
> as possible. You can create a repository at github and temporarily
> push the preliminary fixes there. Send a cleaned version to the mailing
> list when you think that it is ready or substantially improved.
>
> You can use the http://meldmerge.org/ tool to compare the differences
> between the 32-bit assembly source and the 64-bit assembly source on
> your computer locally. And also edit the source code directly in it.
>
> --
> Best regards,
> Siarhei Siamashka