[Pixman] fast-scale branch performance improvements

Tue Mar 16 07:51:57 PDT 2010

On Tuesday 16 March 2010, Alexander Larsson wrote:
> On Tue, 2010-03-16 at 13:26 +0100, Soeren Sandmann wrote:
> > * This chunk:
> >
> >         if (unit_x >= 0)
> >             while ...
> >         else
> >             while ...
> >
> >   is repeated multiple times.
> >
> >   Would it be possible to simply call the existing 'repeat' inline,
> >   possibly extended with a 'unit' parameter?  If it is, then it seems
> >   like all the repeat modes could be supported fairly easily and the
> >   existing fast_composite_scale_nearest() could go away.
>
> I'm not sure about the performance advantages here, but the if is to
> handle the case that if we increase vx then we can avoid the check for
> vx going < 0. So, the two possible versions are:
>
> if (unit_x >= 0)
>     while (vx >= max_vx) vx -= max_vx;
> else
>     while (vx < 0) vx += max_vx;
>
> vs:
>
> while (vx >= max_vx) vx -= max_vx; else
> while (vx < 0) vx += max_vx;
>
> This later is what repeat() would expand to. It unnecessarily makes the
> vx < 0 comparison, but avoids the unit_x comparison. unit_x is constant
> over the whole function, so it might be cheaper. On the other hand it
> probably doesn't matter much.

As long as the value of 'unit_x' is not known at compile time, it is going to 
cause some performance slowdown (it increases registers pressure, adds a 
branch instruction which is admittedly well predictable).

A standard trick to optimize code like this is to convert function

void f(int unit_x, ...)
{
 if (unit_x >= 0)
     while (vx >= max_vx) vx -= max_vx;
 else
     while (vx < 0) vx += max_vx;
}

to something like this

void always_inline f_internal(int unit_is_nonnegative, int unit_x, ...)
{
 if (unit_is_nonnegative)
     while (vx >= max_vx) vx -= max_vx;
 else
     while (vx < 0) vx += max_vx;
}

void f(int unit_x, ...)
{
 if (unit_x >= 0)
     f_internal(1, unit_x, ...);
 else
     f_internal(0, unit_x, ...);
}

Regarding putting all the repeating constructs in small inline functions. I
think this should not be overused. There are common rules like "don't use
goto", "avoid duplicating parts of code", etc. But I don't think these should
be enforced no matter what, especially when performance is important.

If the goal is to get just a correctly working code, splitting repeatable
parts into separate functions and reusing them is great.

If the goal is to get fast code, then these repeated constructs are even a bit
more readable. The point is that they clearly show what is expensive and what
is not. And how the code is really executed. Additionally, when used in
different contexts, these repeated parts may be converted to some faster
equivalents probably in some varying ways depending on the code that is around
it. Repeated parts are not causing an extra reliability risk as long as they
are all well covered by automated regression tests.

One more possible nitpick about this code are the explicit shift ">> 16"
everywhere. They could be replaced with 'pixman_fixed_to_int', but it also
hides the internal implementation details somewhat (especially with the
regards to overflows safety).

The next step is optimizing these functions in assembly (especially bilinear
scaling). If the macros and inline functions are overused, then this code
becomes a bit difficult to use as a reference implementation and has to
be 'deobfuscated' first :-)

The benefit of having a simple code not spread across multilevel inline
function calls and macros is that it is easy to analyze what is generated
by the compiler, and probably even submit a bug to gcc if the code is
obviously poorly optimized.

In my branch, the macros are definitely underused and a lot of code can be
collapsed easily, but going to another extreme is also not very good :-)

Regarding the alex's branch and performance, I already mentioned that it was
much slower for over_8888_0565 case in my benchmark when compared against my
branch on ARM Cortex-A8 (the other cases of scaling are ok). I'm using the
following test program for benchmarking these optimizations:
http://cgit.freedesktop.org/~siamashka/pixman/commit/?h=test-n-bench&id=93ec60149cb3535f70a9e285de0b359ff444f26e

The test program tries to benchmark scaling of when source and destination
image sizes are approximately the same (and the performance can be more or
less directly compared to the simple nonscaled blit).

The results are (variance is only in the last digit):

op=3, src_fmt=20028888, dst_fmt=10020565, speed=5.06 MPix/s (1.21 FPS)
vs.
op=3, src_fmt=20028888, dst_fmt=10020565, speed=8.72 MPix/s (2.08 FPS)

which is quite a lot.

Well, I have written much more than originally intended. I'm going to step
aside for now.

-- 
Best regards,
Siarhei Siamashka