[Pixman] [PATCH] Move fallback decisions from implementations into pixman-cpu.c.

Tue Feb 1 07:53:57 PST 2011

On Wednesday 26 January 2011 20:41:54 Soeren Sandmann wrote:
> Siarhei Siamashka <siarhei.siamashka at gmail.com> writes:
> > By the way, fallbacks may need to be tweaked a bit to be sure that
> > really the fastest code is selected. For example, after the introduction
> > of faster fetchers [1], now the performance of 'over_8888_8888'
> > operation (translucent case) with the nearest scaling of the source
> > image from 2000x2000 to approximately same size looks like this:
> > 
> > C fetcher + C combiner:     ~76.71 MPix/s
> > C fast path:                ~106.47 MPix/s
> > C fetcher + SSE2 combiner:  ~182.65 MPix/s
> > SSE2 fast path:             ~270.57 MPix/s
> > 
> > The important part is that "C fetcher + SSE2 combiner" is now faster
> > than full "C fast path", which performs everything in a single pass, but
> > without SSE2. So when SSE2 combiners are available, it makes sense to
> > deactivate nearest scaling C fast paths for OVER operator. Not everything
> > is so simple though, because this C fast path has special processing for
> > fully transparent or opaque pixels, and may in some cases be actually
> > better. But it is clearly slower if the performance of the worst case is
> > important.
> > 
> > Something similar may apply to PPC Altivec optimizations. Because there
> > are only Altivec combiners but not full Altivec fast paths, the existing
> > C fast paths will be executed for some compositing operations, preventing
> > the use of Altivec combiners.
> 
> Indeed, this is increasingly a problem. Another variation is if there
> is a SIMD fast path that can deal with affine transformations, and a
> general fast path specialized for scaling transformations. Which one
> do you pick for a scaling operation? It isn't obvious.
> 
> Or, suppose there is an MMX fast path for over_8888_8888, but also a
> "noop" source iterator for a8r8g8b8. Then the general path would
> degenerate to just calling sse2_combine_over_u() in a loop with no
> fetching, which is likely faster than mmx_composite_over_8888_8888();
> 
> I don't have a good answer, but here are some random ideas:
> 
> - Classify compositing routines in terms of performance. Ie., try and
>   statically associate each compositing routine and fetcher with a
>   number, then pick the one that should be fastest.

This is more or less reasonable, though it may be a bit too complicated
because we have compositing routines (complete fast paths), fetchers and
combiners. So it would have to make decisions about what is better:
    A or B + C

Additionally, the relative numbers may be different for different hardware.

> - If the fast path cache is empty for the operation in question, try
>   all possibilities and store the best one in the cache. This will
>   probably be unpleasant and complicated.

Running benchmarks at runtime? Indeed sounds a bit unpleasant. 

> - Tweak the fallbacks so that things work in practice. This is ugly
>   and fragile.

I don't quite understand this part. It it about adding a check in each C fast
path function, detecting whether we have a non-C combiner and falling through
in this case?

> - JIT compiler. This is a good solution, but a lot of work.

Or in other words, make sure that the fast paths from the implementation at the
top of the delegate chain (SSE2, NEON, ...) are always a superset of any slower
implementation and we never need to fall back. Except for the cases when the
next implementation down the delegate chain is equally fast for this particular
operation.

> It isn't a huge problem yet, but it will only get worse.

I'm afraid, it's already a problem with some of my coming patches. I just
thought about splitting 'c_fast_paths' array into parts. For example, all the C
nearest fast paths for OVER operator could be moved into a separate fast path
array, which would be only used when not having SIMD optimized combiners
available.

-- 
Best regards,
Siarhei Siamashka