[Pixman] [PATCH 0/7] Faster pipelined ARM NEON bilinear scalers: 'src_8888_8888' and 'src_8888_0565'

Sun Apr 10 10:00:54 PDT 2011

On Tue, Apr 5, 2011 at 6:41 AM, Siarhei Siamashka
<siarhei.siamashka at gmail.com> wrote:
> The result is ~25% faster bilinear scaled 'src_8888_8888' operation and
> 35-40% faster bilinear scaled 'src_8888_0565' operation on different
> ARM Cortex-A8 processors. Approximately 5% from this speedup is contributed
> by the performance tweaks from the first 5 patches. Comparing to
> pixman-0.21.6, these functions are now going to be more than 12x faster:
>
> Benchmark on ARM Cortex-A8 r1p3 @600MHz, 32-bit LPDDR @166MHz:
>  Microbenchmark (scaling 2000x2000 image with scale factor close to 1x):
>  0.21.6: op=1, src=20028888, dst=10020565, speed=3.74 MPix/s
>          op=1, src=20028888, dst=20028888, speed=3.86 MPix/s
>
>  before: op=1, src=20028888, dst=10020565, speed=32.39 MPix/s
>          op=1, src=20028888, dst=20028888, speed=38.52 MPix/s
>
>  after:  op=1, src=20028888, dst=10020565, speed=46.25 MPix/s
>          op=1, src=20028888, dst=20028888, speed=48.47 MPix/s
>
> Benchmark on ARM Cortex-A8 r2p2 @1GHz, 32-bit LPDDR @200MHz:
>  Microbenchmark (scaling 2000x2000 image with scale factor close to 1x):
>  0.21.6: op=1, src=20028888, dst=10020565, speed=6.39 MPix/s
>          op=1, src=20028888, dst=20028888, speed=6.70 MPix/s
>
>  before: op=1, src=20028888, dst=10020565, speed=61.95 MPix/s
>          op=1, src=20028888, dst=20028888, speed=75.10 MPix/s
>
>  after:  op=1, src=20028888, dst=10020565, speed=84.22 MPix/s
>          op=1, src=20028888, dst=20028888, speed=93.11 MPix/s

And actually now I think that we can do better than this. Expecially
when using x8r8g8b8 or r5g6b5 as a destination. I'm currently
investigating splitting color components before doing interpolation
and working with "planar" intermediate rgb representation. This causes
the following changes:
+ we may skip interpolation for the unneeded alpha channel
+ updating of horizontal weights needs to be done not per each 2
pixels, but just once per 8 pixels, which reduces the number of
instructions
+ planar intermediate rgb format is easier to convert to r5g6b5 if it
is used for destination
- we actually need to add code to split color components, which adds a
bunch of VUZP instructions, fortunately some of them may potentially
dual-issue of Cortex-A8
- more challenging registers allocation
- handling of leading/trailing 1/2/4 pixels before/after the main loop
is going to be a little bit more tricky

And even with the current code, one improvement would be possible. We
currently do the following for each 2 pixels:
  vshr.u16  q15, q12, #8  /* get vector of 'distx' values from 'x' vector */
  vadd.u16  q12, q12, q13 /* update vector of 'x' values */
  vshll.u16 q0, d2, #8    /* tmp = tl * 256 */
  vmlsl.u16 q0, d2, d30   /* tmp -= tl * distx */
  vmlal.u16 q0, d3, d30   /* tmp += tr * distx */
  vshll.u16 q10, d22, #8  /* ... similar for the second pixel using */
  vmlsl.u16 q10, d22, d31 /*     the other half of 'distx' vector */
  vmlal.u16 q10, d23, d31
But actually two VSHLL instructions can be replaced by one calculation
of 128-bit vector of '256 - distx' values, and changing
VSHLL+VMLSL+VMLAL+VSHLL+VMLSL+VMLAL with VSUB+VMLAL+VMLAL+VMLAL+VMLAL,
saving 0.5 cycles per pixel. This would require two extra 128-bit NEON
registers though. Fortunately they are readily available for
src_8888_8888 fast path.

One more possible optimization is to reduce interpolation precision
from 8 bit to 4 bit as suggested earlier by Taekyun Kim:
http://lists.freedesktop.org/archives/pixman/2011-February/001044.html
This can make final shifting and color components packing faster,
saving 1 instruction per pixel. Considering that the bilinear code
currently needs ~10 cycles per pixel, this can provide ~10% speedup.
That's the last desperate measure, but given that scr_8888_8888
bilinear scaling utilizes ~58% of memory bandwidth (~93 MPix/s vs.
~160 MPix/s for nearest scaling and vs. ~650 MB/s for memcpy) on my
ARM device, anything might be useful. I just wonder about the current
filter types used in pixman. They are:
    PIXMAN_FILTER_FAST,
    PIXMAN_FILTER_GOOD,
    PIXMAN_FILTER_BEST,
    PIXMAN_FILTER_NEAREST,
    PIXMAN_FILTER_BILINEAR
I wonder what would happen if we introduce some 'low precision'
bilinear filter and map it to PIXMAN_FILTER_GOOD? PIXMAN_FILTER_BEST
and PIXMAN_FILTER_BILINEAR would still remain the same as the existing
bilinear. PIXMAN_FILTER_FAST and PIXMAN_FILTER_NEAREST would still
remain as the existing nearest. Of course this may be only useful if
such lower precision bilinear filter proves to provide much better
image quality than nearest filter and noticeably better performance
than real bilinear filter.

-- 
Best regards,
Siarhei Siamashka