[Pixman] [PATCH 0/2] 7-bit bilinear interpolation precision

Fri Jul 6 00:04:45 PDT 2012

On Thu, Jul 5, 2012 at 10:22 AM, Siarhei Siamashka
<siarhei.siamashka at gmail.com> wrote:
> Separable scaling is good idea, but it is not a silver bullet.
> Downscaling is still a valid use case, and separable scaling would
> provide no reduction for the number of arithmetic operations for it.
> Also x86 SSSE3 and ARM NEON add some extra challenges:
> * Using 8-bit multiplications for horizontal interpolation is
> difficult as the weight factors need to be updated for each pixel.
> Single pass scaling can easily use 8-bit multiplications for vertical
> interpolation as the weight factors are pre-calculated before
> entering loop.
> * Separable scaling needs extra load/store instructions to save
> temporary data between passes
> * When we are approaching the memory speed barrier, the separation of
> operations into passes may result in uneven usage of memory subsystem.

About this last part and the uneven memory bandwidth usage. The
firefox-fishtank trace can be used as a good example. This trace uses
bilinear over_8888_8_8888 operation. Bilinear over_8888_8_8888 has a
special single pass fast path for ARM NEON now. But we can also
decompose it into "bilinear src_8888_8888" + "unscaled over_8888_8_8888"
and get the same result in two passes with a tweaked variant of
    http://hg.mozilla.org/mozilla-central/rev/87dbb95cde7d

=== ARM Cortex-A8 1GHz (DM3730), no monitor connected ===

lowlevel-blt-bench (the numbers are MPix/s):
  unscaled src_8888_8888    =  L1: 648.29  L2: 371.80  M:127.12
  bilinear src_8888_8888    =  L1: 102.88  L2:  91.11  M: 80.57
  unscaled over_8888_8_8888 =  L1: 167.87  L2: 157.61  M: 59.70

  1-pass bilinear over_8888_8_8888 =  L1:  55.27  L2:  50.00  M: 42.42
  2-pass bilinear over_8888_8_8888 =  L1:  63.50  L2:  58.10  M: 35.33

cairo-perf-trace (the numbers are seconds):
  1-pass:  image      firefox-fishtank  343.751  344.139 0.08%    6/6
  2-pass:  image      firefox-fishtank  362.394  364.273 0.31%    6/6

=== ARM Cortex-A8 1GHz (DM3730), 1280x1024-32 at 57Hz monitor ===

lowlevel-blt-bench (the numbers are MPix/s):
  unscaled src_8888_8888    =  L1: 650.22  L2: 240.76  M: 84.80
  bilinear src_8888_8888    =  L1: 102.11  L2:  90.44  M: 70.74
  unscaled over_8888_8_8888 =  L1: 168.50  L2: 154.84  M: 46.92

  1-pass bilinear over_8888_8_8888 =  L1:  55.17  L2:  49.95  M: 37.61
  2-pass bilinear over_8888_8_8888 =  L1:  63.50  L2:  58.51  M: 31.35

cairo-perf-trace (the numbers are seconds):
  1-pass:  image      firefox-fishtank  404.228  405.542 0.16%    6/6
  2-pass:  image      firefox-fishtank  420.228  423.530 0.62%    6/6

==============================================================

The difference between "no monitor" and "1280x1024-32 at 57Hz" is that in
the latter case, framebuffer scanout sending the picture 57 times per
second over DVI is draining some of the precious memory bandwidth.

As can be seen from the lowlevel-blt-bench numbers, 2-pass processing
is faster in terms of CPU cycles (~15% speedup) when all the data is in
L1/L2 cache. Even though we need to save/reload data from the temporary
buffer, there are other performance benefits for splitting the
processing in two parts: we have more flexibility selecting the level
of unrolling for each part separately and easier registers allocation.
However, when the working set is exceeding cache sizes (M: number in
lowlevel-blt-bench), we see ~20% performance loss.

Let's look at the cairo-perf-trace results as a more realistic
benchmark next. The 2-pass processing is also a performance loss
there (but just ~4-5%).

The root cause of the problem in this particular case is that
"bilinear src_8888_8888" (reading from RAM, writing to a
temporary buffer in L1 cache) is heavy on computations. And
"unscaled over_8888_8_8888" is memory heavy (reading the scaled
source pixels from L1 cache, but mask and destination are in RAM).

If we were using pixman general pipeline for this operation (with a
SIMD optimized bilinear fetcher), we would also have mask expansion
from 8-bit to 32-bit happening in a temporary buffer, which both
adds extra computations and makes memory access pattern worse. But
I have no performance data to support this claim yet.

That said, caring about these implementation details only can give
us a few extra percents of performance. But hitting a slow path is
always a big performance disaster. Generalizing bilinear scaling
support so that there are no slow paths is more important.

-- 
Best regards,
Siarhei Siamashka