[Pixman] [PATCH] Add support for aarch64 neon optimization

Tue Apr 5 10:28:19 UTC 2016

> It looks like you have used an automated process to convert the AArch32
> NEON code to AArch64. Will you be able to repeat that process for other
> code, or at least assist others to repeat your steps?

Sorry, but I've wrote before, all of the patch were converted by hand.
"converter script" didn't work correctly.
# But the script was very helpful for me to understand the difference
# between aarch32 and aarch64 :)

> The reason I ask is that I have a large number of outstanding patches to
> the ARM NEON support.

Hmm...
How should we proceed the implementation ?

I've seen a comment that current (and I've based) pixman-arm-neon-asm*.S
were optimized on older Cortex-A8. And, your new patches seem to be
working well on latest Cortex chips.
If so, we should first apply your latest patch to the master, and then,
someone (or I ?) do the conversion to aarch64 again. It would be good both
aarch32 and aarch64 worlds.

# FYI: I've spent 1 week to convert all of the code,
# and 2 weeks to pass all tests.

On 5 April 2016 at 03:53, Ben Avison <bavison at riscosopen.org> wrote:
> On Sat, 02 Apr 2016 13:30:58 +0100, Mizuki Asakura <ed6e117f at gmail.com>
> wrote:
>>
>> This patch only contains STD_FAST_PATH codes, not scaling (nearest,
>> bilinear) codes.
>
>
> Hi Mizuki,
>
> It looks like you have used an automated process to convert the AArch32
> NEON code to AArch64. Will you be able to repeat that process for other
> code, or at least assist others to repeat your steps?
>
> The reason I ask is that I have a large number of outstanding patches to
> the ARM NEON support. The process of getting them merged into the
> FreeDesktop git repository has been very slow because there aren't many
> people on this list with the time and ability to review them, however my
> versions are in many cases up to twice the speed of the FreeDesktop
> versions, and it would be a shame if AArch64 couldn't benefit from them.
> If your AArch64 conversion is a one-time thing, it will make make it
> extremely difficult to merge my changes in.
>
>> After completing optimization this patch, scaling related codes should be
>> done.
>
>
> One of my aims was to implement missing "iter" routines so as to accelerate
> scaled plots for a much wider combination of pixels formats and Porter-Duff
> combiner rules than the existing limited selection of fast paths could
> cover. If you look towards the end of my patch series here:
>
> https://github.com/bavison/pixman/commits/arm-neon-release1
>
> you'll see that I discovered that I was actually outperforming Pixman's
> existing bilinear plotters so consistently that I'm advocating removing
> them entirely, with the additional advantage that it simplifies the code
> base a lot. So you might want to consider whether it's worth bothering
> converting those to AArch64 in the first place.
>
> I would maybe go so far as to suggest that you try converting all the iters
> first and only add fast paths if you find they do better than the iters.
> One of the drawbacks of using iters is that the prefetch code can't be as
> sophisticated - it can't easily be prefetching the start of the next row
> while it is still working on the end of the current one. But since hardware
> prefetchers are better now and conditional execution is hard in AArch64,
> this will be less of a drawback with AArch64 CPUs.
>
> I'll also repeat what has been said, that it's very neat the way the
> existing prefetch code sneaks calculations into pipeline stalls, but it was
> only ever really ideal for Cortex-A8. With Cortex-A7 (despite the number,
> actually a much more recent 32-bit core) I noted that it was impossible to
> schedule such complex prefetch code without adding to the cycle count, at
> least when the images were already in the cache.
>
> Ben