[Pixman] fast-scale branch performance improvements

Wed Mar 17 00:44:37 PDT 2010

On Tuesday 16 March 2010, Alexander Larsson wrote:
> Further simplifying by removing support for unit_x < 0 (i.e. mirrored
> scaling) gives:
>
> == nearest tiled SRC ==
> op=1, src_fmt=20028888, dst_fmt=20028888, speed=655.11 MPix/s (156.19 FPS)
> op=1, src_fmt=20028888, dst_fmt=10020565, speed=136.02 MPix/s (32.43 FPS)
> op=1, src_fmt=10020565, dst_fmt=10020565, speed=619.16 MPix/s (147.62 FPS)

Just to make it clear what is the use case. NORMAL repeat is important for the
browser (zooming tiled backgrounds). The image quality is quite bad for
NEAREST scaling, but if CPU is slow, it becomes a poor man's choice.

If the code replicates a very small tile, almost no cache misses are expected
when accessing the source image and the performance should be comparable to
solid fill if memory bandwidth is a limiting factor. But CPU is obviously the
bottleneck here for both Intel Core2 and ARM Cortex-A8. Writes to memory are
nonblocking from the CPU point of view and go through the write combining
buffer. As long as the CPU can't process data fast enough (and this is the
case for 'nearest tiled SRC' test), memory writes are mostly transparent and
don't affect the performance.

For ordinary scaling, on Intel Core2 (x86-64) I get an amazing result showing
that MPix/s rating is the same for both nonscaled and scaled blits using color
format a8r8g8b8. But for r5g6b5 format, the nearest scaler falls a bit behind.
Trying 8bpp would increase this gap even more. So the nearest scaler can be 
memory bandwidth limited if hardware prefetch works efficiently and CPU is
much faster than memory. For lower bpp color formats, optimizing CPU usage
still becomes important.

For ARM Cortex-A8, scaling is always slower than simple blit, but it may
need tweaking and adding explicit prefetch. Also NEON unit can access memory
much faster, so it would be practically impossible to reach same performance
as simple blit for scaling.

> It might be interesting to duplicate the inner loop, once for each
> unit_x sign to get this performance increase always.

Duplicating inner loop is easy using the trick with always_inline function
and described here:
http://lists.freedesktop.org/archives/pixman/2010-March/000111.html

But if fast performing scaling with REFLECT repeat is wanted (the one which
checks only either lower or upper boundary per step), it has to be implemented
with something like DFA. So that the function body would be duplicated for the
cases walking forward (unit_x >= 0) and walking backwards (unit_x < 0). In the
case of source image boundary crossing, unit_x can be negated, vx updated,
then switch to a different state (block of code walking in opposite direction)
can be performed using goto operator. This makes code a bit more complex, but
ensures the best performance. Due to walking in both horizontal and vertical
directions, a total 4 slightly different blocks of code may be required. Or
vertical walking can just always check both boundaries as it is less important
for performance.

-- 
Best regards,
Siarhei Siamashka