[cairo] Performance of the refactored Pixman

Sun Jul 12 03:26:04 PDT 2009

On Friday 10 July 2009, Soeren Sandmann wrote:
> Siarhei Siamashka <siarhei.siamashka at gmail.com> writes:
[...]
> > Old xorg-server 1.3 (its XRender software implementation that became
> > pixman later) had dispatch code, which consisted of a large switch and it
> > also used structure pointers to pass data around. Now this has changed to
> > linear search in tables, delegates, passing data as real function
> > arguments. These changes probably had been justified by better
> > aesthetics, maintainability, etc.
> >
> > But there is an interesting question: how much is too much and what
> > performance loss can be acceptable to justify other benefits?
>
> Part of the justification for those changes was certainly
> maintability, but it's just not true that those changes were made with
> no regard for performance as you imply. I'll go over each one:
>
> * Large switch.
>
> The large switch, also known as the "Switch of Doom" started out a
> small switch that just selected between a small number of common fast
> paths. Over time it grew to contain mmx and sse and vmx code and got
> littered with #ifdefs and special cases to the point where no one
> could understand what was going on.
>
> When I replaced it with a linear search in a table, the plan was
> always that the search could be sped up if necessary. But the cairo
> performance test suite showed no changes for any of the
> benchmarks. Except for two, that got much *faster* because the linear
> search found a fast path that had fallen through the cracks in the
> maze of switch/case/if clauses.
>
> So getting rid of the Switch of Doom was a performance *improvement*,
> and one that came as a direct result of more maintainable and
> therefore more correct code.

The problem with cairo-perf benchmark is that the minimal image size it uses
is 16x16. For simulating real operations with font glyphs it would make sense
to also try something like 8x8 or even smaller. Also cairo has its own
overhead and it takes a bit more to reach the real blitter code.

So minor regressions in cairo-bench visible as just a few percents may be
actually tens of percents when dealing with real font glyphs.

That's why this needs a bit more investigation and maybe some more suitable
benchmarks.

> * Structure pointers to pass data around
>
> A structure pointer was used in 1.3 to pass data between
> fbCompositeGeneral() and fbCompositeRect(), and nowhere else.
>
> Since fbCompositeRect() was getting called once per rectangle in the
> composite region, the total effect in the common case of one rectangle
> was to make a gratuitous copy of the arguments on the stack.
>
> Moreover, fbCompositeRect() was only called once so the compiler would
> inline it, but then fail to undo the argument copying, so in *no* case
> were there any conceivable benefit from this, and there was quite
> significant harm done from preventing register allocation of the
> variables.
>
> This is the kind of brain damage you'll get if you apply
> "optimizations" based on vague, unquantifiable guesswork.

This my comment was based only on the information from Jonathan. I had
a hope that he would submit some kind of reproducible benchmark so that we
could verify it.

> * Delegates
>
> As I noted here:
>
>     http://lists.cairographics.org/archives/cairo/2009-May/017210.html
>
> I couldn't measure any real difference with the cairo performance test
> suite.
>
> There are more calls now, but at least on x86, gcc turns them into
> tail calls that reuse the arguments on the stack, so the impact is
> low. This may be different on ARM, which is why I said that if I'd
> take a patch to add the argument structs if a performance improvement
> could be demonstrated.

First 4 arguments are passed in registers and the rest goes to stack. ARM
should be similar to x86-64. And if there is any performance regression, it
should be visible everywhere (or the compiler may be a bit more stupid on some
platforms than on the others).

These all things are actually quite easy to verify with proper benchmarks.
And having a look at pixman calls overhead is still in my TODO list, with kind
of low priority though.

> > Does it make sense to add a test benchmark code which really stresses
> > pixman internal dispatch logic? Something like the code, working with
> > extremely small images (1x1 sized when putting it to extreme) would be
> > good to simulate the worst case. Or is it preferable to have benchmarks
> > which try to simulate some problematic, but still real use cases?
>
> Benchmarks in general are much better than vague guessing about what
> may or may not be fast.
>
> I tend to prefer benchmarks that simulate a real use case, but even
> micro benchmarks can be useful as long as we don't fall into the trap
> of optimizing some corner case that is completely swamped by other things.

Agreed

-- 
Best regards,
Siarhei Siamashka