[Pixman] [PATCH 0/7] Faster pipelined ARM NEON bilinear scalers: 'src_8888_8888' and 'src_8888_0565'

Siarhei Siamashka siarhei.siamashka at gmail.com
Mon Apr 4 20:41:53 PDT 2011


From: Siarhei Siamashka <siarhei.siamashka at nokia.com>

This patch set consists of two parts. The first 5 patches add minor
optimizations and tweaks to the bilinear scaler template macro. Such
as aligned writes to destination, both 4 and 8 pixels per iteration
unrolling and the support for software pipelining. The last two patches
introduce optimized versions of 'src_8888_8888' and 'src_8888_0565'
bilinear scaling functions.

The result is ~25% faster bilinear scaled 'src_8888_8888' operation and
35-40% faster bilinear scaled 'src_8888_0565' operation on different
ARM Cortex-A8 processors. Approximately 5% from this speedup is contributed
by the performance tweaks from the first 5 patches. Comparing to
pixman-0.21.6, these functions are now going to be more than 12x faster:

Benchmark on ARM Cortex-A8 r1p3 @600MHz, 32-bit LPDDR @166MHz:
 Microbenchmark (scaling 2000x2000 image with scale factor close to 1x):
  0.21.6: op=1, src=20028888, dst=10020565, speed=3.74 MPix/s
          op=1, src=20028888, dst=20028888, speed=3.86 MPix/s

  before: op=1, src=20028888, dst=10020565, speed=32.39 MPix/s
          op=1, src=20028888, dst=20028888, speed=38.52 MPix/s

  after:  op=1, src=20028888, dst=10020565, speed=46.25 MPix/s
          op=1, src=20028888, dst=20028888, speed=48.47 MPix/s

Benchmark on ARM Cortex-A8 r2p2 @1GHz, 32-bit LPDDR @200MHz:
 Microbenchmark (scaling 2000x2000 image with scale factor close to 1x):
  0.21.6: op=1, src=20028888, dst=10020565, speed=6.39 MPix/s
          op=1, src=20028888, dst=20028888, speed=6.70 MPix/s

  before: op=1, src=20028888, dst=10020565, speed=61.95 MPix/s
          op=1, src=20028888, dst=20028888, speed=75.10 MPix/s

  after:  op=1, src=20028888, dst=10020565, speed=84.22 MPix/s
          op=1, src=20028888, dst=20028888, speed=93.11 MPix/s

These patches don't include any optimizations for bilinear scaled
'src_0565_0565' fast path yet, even though it is quite important
for embedded systems. The problem is that a simple single pass
implementation performs too many r5g6b5->x8r8g8b8 conversions
(4 times the number of destination pixels). So one more option
is to perform two pass conversion, with the first pass converting
source image scanlines to x8r8g8b8 format in the temporary buffers
and the second pass using the already well optimized bilinear
'src_8888_0565' scanline function. This is going to result in
less r5g6b5->x8r8g8b8 conversions for any upscaling and also for
any downscaling with the scale factor less than 2x. And anyway, for
downscaling more than 2x, we are getting into high quality downscaling
domain. Additional use for 'src_8888_0565' bilinear scaline function
is the potential support for fast scaled YUV->RGB conversion. It may
be beneficial to first perform unscaled YUV->x8r8g8b8 conversion and
then scale the result (it can be easily an overall performance win
for upscaling).


The same patches are also available here:
  http://cgit.freedesktop.org/~siamashka/pixman/log/?h=sent/20110405-pipelined-neon-bilinear

Siarhei Siamashka (7):
  ARM: tweaked horizontal weights update in NEON bilinear scaling code
  ARM: use aligned memory writes in NEON bilinear scaling code
  ARM: support for software pipelining in bilinear macros
  ARM: use less ARM instructions in NEON bilinear scaling code
  ARM: support different levels of loop unrolling in bilinear scaler
  ARM: pipelined NEON implementation of bilinear scaled 'src_8888_8888'
  ARM: pipelined NEON implementation of bilinear scaled 'src_8888_0565'

 pixman/pixman-arm-neon-asm.S |  613 +++++++++++++++++++++++++++++++++++++-----
 1 files changed, 548 insertions(+), 65 deletions(-)

-- 
1.7.3.4



More information about the Pixman mailing list