[Pixman] [PATCH 0/2] A couple of ARM improvements

Siarhei Siamashka siarhei.siamashka at gmail.com
Tue Apr 5 07:14:11 PDT 2011

On Mon, Apr 4, 2011 at 6:33 PM, Søren Sandmann Pedersen
<sandmann at cs.au.dk> wrote:
> I got myself an ARM netbook which by default is using subpixel
> antialiased text in 565 mode, so fast_composite_over_n_8888_0565_ca()
> is showing up on profiles.

Thanks. I'm a bit surprised that I never have encountered the use of
this function in the wild. But I guess it's another case of a
chicken/egg problem: somebody has cleverly disabled subpixel
antialiasing in default configuration on the systems I worked with
because it was slow, and that's why I did not have a chance to see the
problem and fix it.

Anyway, with more people hacking pixman on ARM devices we have more
chances to get all the relevant use cases identified and properly

> The following is an implementation of that
> operation in NEON assembly along with a tiny improvement to the
> 8888_8888_ca version, where a single quad-word instruction could be
> used instead of two double-word instructions.
> Some notes:
> - The combine_mask_ca replacement could be shared between 8888_ca and
>  0565_ca, except that the 565 version can ignore the mask alpha
>  because it doesn't need compute a destination alpha.

One of the problem with tuning NEON optimizations for performance is
that it happens to be quite the opposite of what you are suggesting -
we are forced not to split the code into nice maintainable independent
parts, but have to combine multiple operations into larger blocks of
code so that the instructions can be freely reordered inside of them.
Omitting one or two instructions based on macro argument is useful,
but still pipeline stalls are generally a bigger problem. There are
even supplementary macros such as 'convert_0565_to_8888',
'convert_0565_to_x888', 'convert_8888_to_0565',
'convert_four_0565_to_x888_packed' defined in the neon header file.
And they are used in some of the fast paths. But such fast paths are
not really fully optimized (see below for some rationale).

Getting the last remaining 20-30% of performance may easily make the
source code much bigger and less readable. One example is my latest
patchset with the final tunings for NEON bilinear scaling:
It really is a huge bloody mess for the pipelined main loop body, but
any cycle saved there clearly improves performance for scaling images
(no matter if the images fit into cache or not).

So the main question is where to put most efforts? I basically used
test/lowlevel-blt-bench program to see whether the primary bottleneck
for any particular unscaled fast path is CPU or memory performance. If
the bottleneck was CPU, then it simply got more attention. Some sample
results from lowlevel-blt-bench (measured in MPix/s) from S5PC110, ARM
Cortex-A8 @1GHz are below. The operations having similar memory access
pattern are grouped together:

add_8888_8888       =  L1: 484.13  L2: 447.15  M: 95.44
over_8888_8888      =  L1: 338.44  L2: 306.22  M: 94.21
over_n_8888_8888_ca =  L1: 170.14  L2: 170.28  M: 90.22

over_8888_0565      =  L1: 267.83  L2: 222.30  M:132.16
over_n_8888_0565_ca =  L1: 133.95  L2: 130.34  M:118.65

over_n_8_0565       =  L1: 176.52  L2: 165.19  M:155.15

As can be seen, 'over_n_8888_8888_ca' has a lot of headroom when
comparing L1 cache and memory performance numbers. It means that NEON
unit already crunches data much faster than memory controller can
provide it. But as another example, 'over_n_8_0565' is different - it
puts much less pressure on the memory controller (only 1+2 bytes to
read and 2 bytes to write per pixel), but additionally has to do
conversion from and to r5g6b5 which means more work for NEON unit. So
I mostly ignored 'over_n_8888_8888_ca' and put a bit more efforts
trying to properly optimize 'over_n_8_0565' in order to avoid
unnecessary stalls:

I would generally recommend to have a look at the code from
'over_n_8_0565', because it is a better example of NEON optimizations
than 'over_n_8888_8888_ca' which you took as a base for
'over_n_8888_0565_ca' code. The main difference between
'over_n_8888_8888_ca' and 'over_n_8888_0565_ca' is that the former has
a lot of headroom while the latter seems to be exactly on the
borderline. Which makes more optimizations somewhat desirable, but
they are hardly going to improve anything significantly.

I guess we can take optimizations for any fast path functions, more
comments and code cleanups. But I just did not see a strong need to
get every last bit of performance for the functions which would not
benefit much unless working with the perfectly cached data. Very
common functions like 'over_8888_8888' are notable exceptions, they
got full optimizations anyway :)

>  It may make sense to split this code into a new macro that takes an
>  "opaque_destination" argument. This would allow the same
>  optimization to be used for x8r8g8b8 destinations.

I also considered this earlier. For example 'over_8888_x888' could
clearly use less instructions than 'over_8888_8888'. But looking at
how 'over_8888_8888' is already excessively fast, I decided not to add
more such function variants. Unrolling and pipelining is already a bit
bad for compiled code size.

> - Further optimization is possible if we know that the solid source is
>  opaque, which is very often the case. When it is, there is no need
>  to multiply it onto the mask. An "opaque_source" argument could be
>  added to the macro as well and then separate x888 versions of the
>  function could be generated.

Yes, this could be used.

> - It might be a win to optimize out destination access whenever the
>  src * mask is 0 or 1. However, this might require some more involved
>  changes to the code generation framework.

This is actually rather tricky for Cortex-A8 due to many reasons:

And I think this can be only applied to the functions with really lots
of CPU performance headroom like 'over_8888_8888'.

Best regards,
Siarhei Siamashka

More information about the Pixman mailing list