[Pixman] [PATCH] Add support for aarch64 neon optimization

Ben Avison bavison at riscosopen.org
Mon Apr 4 18:53:36 UTC 2016

On Sat, 02 Apr 2016 13:30:58 +0100, Mizuki Asakura <ed6e117f at gmail.com> wrote:
> This patch only contains STD_FAST_PATH codes, not scaling (nearest,
> bilinear) codes.

Hi Mizuki,

It looks like you have used an automated process to convert the AArch32
NEON code to AArch64. Will you be able to repeat that process for other
code, or at least assist others to repeat your steps?

The reason I ask is that I have a large number of outstanding patches to
the ARM NEON support. The process of getting them merged into the
FreeDesktop git repository has been very slow because there aren't many
people on this list with the time and ability to review them, however my
versions are in many cases up to twice the speed of the FreeDesktop
versions, and it would be a shame if AArch64 couldn't benefit from them.
If your AArch64 conversion is a one-time thing, it will make make it
extremely difficult to merge my changes in.

> After completing optimization this patch, scaling related codes should be done.

One of my aims was to implement missing "iter" routines so as to accelerate
scaled plots for a much wider combination of pixels formats and Porter-Duff
combiner rules than the existing limited selection of fast paths could
cover. If you look towards the end of my patch series here:


you'll see that I discovered that I was actually outperforming Pixman's
existing bilinear plotters so consistently that I'm advocating removing
them entirely, with the additional advantage that it simplifies the code
base a lot. So you might want to consider whether it's worth bothering
converting those to AArch64 in the first place.

I would maybe go so far as to suggest that you try converting all the iters
first and only add fast paths if you find they do better than the iters.
One of the drawbacks of using iters is that the prefetch code can't be as
sophisticated - it can't easily be prefetching the start of the next row
while it is still working on the end of the current one. But since hardware
prefetchers are better now and conditional execution is hard in AArch64,
this will be less of a drawback with AArch64 CPUs.

I'll also repeat what has been said, that it's very neat the way the
existing prefetch code sneaks calculations into pipeline stalls, but it was
only ever really ideal for Cortex-A8. With Cortex-A7 (despite the number,
actually a much more recent 32-bit core) I noted that it was impossible to
schedule such complex prefetch code without adding to the cycle count, at
least when the images were already in the cache.


More information about the Pixman mailing list