[Pixman] [PATCH 00/10]

Wed Nov 3 16:22:15 PDT 2010

From: Siarhei Siamashka <siarhei.siamashka at nokia.com>

This patchset adds ARM NEON optimized fast paths for a few types of
compositing operations when the source image is using nearest scaling.
The selection of operations to be optimized was basically done based on
what 'scaling-test' program currently supports. But any existing
ARM NEON optimizations can be relatively easily reused and extended to
support nearest scaling for the source image.

The code from these patches builds up on the main loop template from the
header file 'pixman-fast-path.h'. It was introduced earlier in:
http://lists.freedesktop.org/archives/pixman/2010-September/000547.html

There are a few shortcomings with this template. Like the NONE repeat
is still causing problems for some operations. Also REFLECT repeat should
be added in order to finally fully replace and retire NEAREST_FAST_PATH
macro from 'pixman-fast-path.c'. Moreover, looks like adding support for
scaled fast paths with a8 mask is also needed for some practical use cases.
I'm trying to experiment with generalizing and further improving the
scaled fast path template in the following branch (it's still not fully
ready yet):
http://cgit.freedesktop.org/~siamashka/pixman/log/?h=a8-mask-scaling-wip

Regarding the performance of this code. The pixel fetcher supporting nearest
scaling for NEON fast paths is still not fully tuned for performance. The
things to try are:
* experiment with prefetch instructions to see if they can improve performance
  (they seem to be useless for the scaled source, but the prefetch effect for
  mask and destination still can be investigated)
* deinterleaving of color components for the scaled source images is currently
  expensive, so selectively using it for the destination and mask while
  disabling it for the source may be tried
* alternative scaled pixels fetching methods may be tried:
  a) use VLD1 instructions to load pixel data to NEON (like it is done now)
  b) use LDR and OR instructions on ARM side to construct the pixel data in ARM
     registers and then move them to NEON side using VMOV (more work for ARM
     pipeline, less work for NEON)
  c) try using a single SMLATB instruction instead of a pair of MOV + ADD
     (SMLATB is slow and takes 2 cycles, has much higher latency, but may
     reduce pressure on the instruction decoder if it is the bottleneck)
  d) try to split scaled pixels fetcher into parts in order to better
     interleave ARM and NEON instructions and avoid any possible bubbles in
     NEON pipeline
  in all the cases, instructions scheduling still might be improved more

As for the performance, the patches definitely improve it. Still in some cases
the improvement of using these new fast paths over doing separate scaling and
then separate composite operation is not that large. One case that works
somewhat better than the others is src_0565_8888 (all the performance numbers
are in commit messages). It provides performance 76.98 MPix/s. While separate
scaling would be 94.91 MPix/s and separate 565->8888 conversion would be
137.78 MPix/s. So the performance of doing scaling and then compositing
separately can be estimated as:
    1 / (1 / 94.91 + 1 / 137.78) = 56.2 MPix/s
These numbers are for the case when scaling factor is very close to 1x. This
estimation also does not take into account the locality of data (intermediate
temporary data could be stored in L1 cache). But my older experiments with
memory performance on OMAP3 chips (ARM Cortex-A8 CPU) showed that contrary to
expectations, doing copy in separate steps like
    "source buffer -> L1 cache" and then "L1 cache -> destination buffer"
is about twice slower than doing direct copy
    "source buffer -> destination buffer".
I have some theories about why it may happen. One of them is that the best
total memory throughput may be achieved when memory is both read and written
at the same time (kind of full-duplex). Another one is that the data in the
intermediate buffer may be evicted too often from L1 cache (due to random
replacement policy). These are just speculations (maybe have nothing
common with the real cause), but the end result is the same:
the default "fetch" -> "combine" -> "store" pixman pipeline works slow on
this type of hardware.

So naturally, when implementing scaled fast paths, I'm also currently favouring
single pass processing for everything. Hopefully the performance still may be
improved more by tweaking scaled pixels fetcher. But of course, it might be
that scaling (scattered small memory accesses) adds more pressure on the
load-store unit and changes the whole picture significantly. Let's see.

These patches are also available here:
http://cgit.freedesktop.org/~siamashka/pixman/log/?h=sent/neon-nearest-scaling-20101103

Siarhei Siamashka (10):
  ARM: fix 'vld1.8'->'vld1.32' typo in add_8888_8888 NEON fast path
  ARM: NEON: source image pixel fetcher can be overrided now
  ARM: nearest scaling support for NEON scanline compositing functions
  ARM: macro template in C code to simplify using scaled fast paths
  ARM: performance tuning of NEON nearest scaled pixel fetcher
  ARM: NEON optimization for scaled over_8888_8888 with nearest filter
  ARM: NEON optimization for scaled over_8888_0565 with nearest filter
  ARM: NEON optimization for scaled src_8888_0565 with nearest filter
  ARM: NEON optimization for scaled src_0565_8888 with nearest filter
  ARM: optimization for scaled src_0565_0565 with nearest filter

 pixman/pixman-arm-common.h   |   40 ++++++++
 pixman/pixman-arm-neon-asm.S |  100 ++++++++++++++-----
 pixman/pixman-arm-neon-asm.h |  216 +++++++++++++++++++++++++++++++++++++++---
 pixman/pixman-arm-neon.c     |   30 ++++++
 pixman/pixman-arm-simd-asm.S |   70 ++++++++++++++
 pixman/pixman-arm-simd.c     |    7 ++
 6 files changed, 423 insertions(+), 40 deletions(-)

-- 
1.7.2.2