[Pixman] [PATCH] Move fallback decisions from implementations into pixman-cpu.c.
sandmann at cs.au.dk
Wed Jan 26 10:41:54 PST 2011
Siarhei Siamashka <siarhei.siamashka at gmail.com> writes:
> By the way, fallbacks may need to be tweaked a bit to be sure that
> really the fastest code is selected. For example, after the introduction
> of faster fetchers , now the performance of 'over_8888_8888'
> operation (translucent case) with the nearest scaling of the source
> image from 2000x2000 to approximately same size looks like this:
> C fetcher + C combiner: ~76.71 MPix/s
> C fast path: ~106.47 MPix/s
> C fetcher + SSE2 combiner: ~182.65 MPix/s
> SSE2 fast path: ~270.57 MPix/s
> The important part is that "C fetcher + SSE2 combiner" is now faster
> than full "C fast path", which performs everything in a single pass, but
> without SSE2. So when SSE2 combiners are available, it makes sense to
> deactivate nearest scaling C fast paths for OVER operator. Not everything
> is so simple though, because this C fast path has special processing for
> fully transparent or opaque pixels, and may in some cases be actually
> better. But it is clearly slower if the performance of the worst case is
> Something similar may apply to PPC Altivec optimizations. Because there
> are only Altivec combiners but not full Altivec fast paths, the existing
> C fast paths will be executed for some compositing operations, preventing
> the use of Altivec combiners.
Indeed, this is increasingly a problem. Another variation is if there
is a SIMD fast path that can deal with affine transformations, and a
general fast path specialized for scaling transformations. Which one
do you pick for a scaling operation? It isn't obvious.
Or, suppose there is an MMX fast path for over_8888_8888, but also a
"noop" source iterator for a8r8g8b8. Then the general path would
degenerate to just calling sse2_combine_over_u() in a loop with no
fetching, which is likely faster than mmx_composite_over_8888_8888();
I don't have a good answer, but here are some random ideas:
- Classify compositing routines in terms of performance. Ie., try and
statically associate each compositing routine and fetcher with a
number, then pick the one that should be fastest.
- If the fast path cache is empty for the operation in question, try
all possibilities and store the best one in the cache. This will
probably be unpleasant and complicated.
- Tweak the fallbacks so that things work in practice. This is ugly
- JIT compiler. This is a good solution, but a lot of work.
It isn't a huge problem yet, but it will only get worse.
More information about the Pixman