[Pixman] [PATCH] Move fallback decisions from implementations into pixman-cpu.c.
siarhei.siamashka at gmail.com
Tue Feb 1 07:53:57 PST 2011
On Wednesday 26 January 2011 20:41:54 Soeren Sandmann wrote:
> Siarhei Siamashka <siarhei.siamashka at gmail.com> writes:
> > By the way, fallbacks may need to be tweaked a bit to be sure that
> > really the fastest code is selected. For example, after the introduction
> > of faster fetchers , now the performance of 'over_8888_8888'
> > operation (translucent case) with the nearest scaling of the source
> > image from 2000x2000 to approximately same size looks like this:
> > C fetcher + C combiner: ~76.71 MPix/s
> > C fast path: ~106.47 MPix/s
> > C fetcher + SSE2 combiner: ~182.65 MPix/s
> > SSE2 fast path: ~270.57 MPix/s
> > The important part is that "C fetcher + SSE2 combiner" is now faster
> > than full "C fast path", which performs everything in a single pass, but
> > without SSE2. So when SSE2 combiners are available, it makes sense to
> > deactivate nearest scaling C fast paths for OVER operator. Not everything
> > is so simple though, because this C fast path has special processing for
> > fully transparent or opaque pixels, and may in some cases be actually
> > better. But it is clearly slower if the performance of the worst case is
> > important.
> > Something similar may apply to PPC Altivec optimizations. Because there
> > are only Altivec combiners but not full Altivec fast paths, the existing
> > C fast paths will be executed for some compositing operations, preventing
> > the use of Altivec combiners.
> Indeed, this is increasingly a problem. Another variation is if there
> is a SIMD fast path that can deal with affine transformations, and a
> general fast path specialized for scaling transformations. Which one
> do you pick for a scaling operation? It isn't obvious.
> Or, suppose there is an MMX fast path for over_8888_8888, but also a
> "noop" source iterator for a8r8g8b8. Then the general path would
> degenerate to just calling sse2_combine_over_u() in a loop with no
> fetching, which is likely faster than mmx_composite_over_8888_8888();
> I don't have a good answer, but here are some random ideas:
> - Classify compositing routines in terms of performance. Ie., try and
> statically associate each compositing routine and fetcher with a
> number, then pick the one that should be fastest.
This is more or less reasonable, though it may be a bit too complicated
because we have compositing routines (complete fast paths), fetchers and
combiners. So it would have to make decisions about what is better:
A or B + C
Additionally, the relative numbers may be different for different hardware.
> - If the fast path cache is empty for the operation in question, try
> all possibilities and store the best one in the cache. This will
> probably be unpleasant and complicated.
Running benchmarks at runtime? Indeed sounds a bit unpleasant.
> - Tweak the fallbacks so that things work in practice. This is ugly
> and fragile.
I don't quite understand this part. It it about adding a check in each C fast
path function, detecting whether we have a non-C combiner and falling through
in this case?
> - JIT compiler. This is a good solution, but a lot of work.
Or in other words, make sure that the fast paths from the implementation at the
top of the delegate chain (SSE2, NEON, ...) are always a superset of any slower
implementation and we never need to fall back. Except for the cases when the
next implementation down the delegate chain is equally fast for this particular
> It isn't a huge problem yet, but it will only get worse.
I'm afraid, it's already a problem with some of my coming patches. I just
thought about splitting 'c_fast_paths' array into parts. For example, all the C
nearest fast paths for OVER operator could be moved into a separate fast path
array, which would be only used when not having SIMD optimized combiners
More information about the Pixman