[Pixman] [PATCH] Move fallback decisions from implementations into pixman-cpu.c.

Wed Jan 26 10:41:54 PST 2011

Siarhei Siamashka <siarhei.siamashka at gmail.com> writes:

> By the way, fallbacks may need to be tweaked a bit to be sure that
> really the fastest code is selected. For example, after the introduction
> of faster fetchers [1], now the performance of 'over_8888_8888'
> operation (translucent case) with the nearest scaling of the source
> image from 2000x2000 to approximately same size looks like this:
> 
> C fetcher + C combiner:     ~76.71 MPix/s
> C fast path:                ~106.47 MPix/s
> C fetcher + SSE2 combiner:  ~182.65 MPix/s
> SSE2 fast path:             ~270.57 MPix/s
> 
> The important part is that "C fetcher + SSE2 combiner" is now faster
> than full "C fast path", which performs everything in a single pass, but
> without SSE2. So when SSE2 combiners are available, it makes sense to
> deactivate nearest scaling C fast paths for OVER operator. Not everything
> is so simple though, because this C fast path has special processing for
> fully transparent or opaque pixels, and may in some cases be actually
> better. But it is clearly slower if the performance of the worst case is
> important.
> 
> Something similar may apply to PPC Altivec optimizations. Because there
> are only Altivec combiners but not full Altivec fast paths, the existing
> C fast paths will be executed for some compositing operations, preventing
> the use of Altivec combiners.

Indeed, this is increasingly a problem. Another variation is if there
is a SIMD fast path that can deal with affine transformations, and a
general fast path specialized for scaling transformations. Which one
do you pick for a scaling operation? It isn't obvious.

Or, suppose there is an MMX fast path for over_8888_8888, but also a
"noop" source iterator for a8r8g8b8. Then the general path would
degenerate to just calling sse2_combine_over_u() in a loop with no
fetching, which is likely faster than mmx_composite_over_8888_8888();

I don't have a good answer, but here are some random ideas:

- Classify compositing routines in terms of performance. Ie., try and
  statically associate each compositing routine and fetcher with a
  number, then pick the one that should be fastest.

- If the fast path cache is empty for the operation in question, try
  all possibilities and store the best one in the cache. This will
  probably be unpleasant and complicated.

- Tweak the fallbacks so that things work in practice. This is ugly
  and fragile.

- JIT compiler. This is a good solution, but a lot of work.

It isn't a huge problem yet, but it will only get worse.

Soren