[Pixman] [PATCH 0/2] SSSE3 iterator and fast path selection issues

Thu Aug 29 10:52:55 PDT 2013

On Thu, Aug 29, 2013 at 01:02:51PM -0400, Søren Sandmann Pedersen wrote:
> The following patches add a new SSSE3 implementation and an iterator
> for separable bilinear scaling. As expected, the new iterator is
> clearly faster than the C iterator.
> 
> Unfortunately or fortunately, when combined with the SSE2 combiner, it
> is also clearly faster than the existing SSE2 fast paths. This is a
> problem because the fast paths will be chosen ahead of the general
> implementation and therefore of the new iterator.
> 
> This is an instance of a more general problem that we have had for a
> long time and which will only get worse as we add new iterators and
> fast path. The following assumptions that we are making:
> 
> - That a fast path is always better than iterators + combiners
> - That newer SIMD is always better than older (e.g., SSE2 > MMX)
> 
> are both wrong. The SSSE3 iterator demonstrates that the first is
> wrong, and the second could be wrong for example if we had an MMX fast
> path specialized for scaling and an SSE2 one specialized for general
> bilinear transformation. Both could handle a scaling operation, but
> it's quite likely that the MMX one would be faster.
> 
> It's not the first time this has come up:
> 
>     http://lists.freedesktop.org/archives/pixman/2011-January/000949.html
>     http://lists.freedesktop.org/archives/pixman/2011-January/000950.html
>     http://lists.freedesktop.org/archives/pixman/2011-February/000957.html
> 
> but no clear solution has been found so far.
> 
> I still don't have a good answer; all ideas welcome.

Again not a general solution, but in this case you could install the
ssse3 combiners first and have the sse2 implementation inspect whether
the faster combiners were available and not install its own fast scaling
paths in that case.

FWIW, I favour the classify each fastpath at compile time and pick the
favourite rather than the first match approach. I would measure the
performance on classes of machines and have each select its own priority
table (within reason, with magic TBD). For example, Atom has
sufficienctly different characteristics to Core that I expect there to be
a difference in ranking.

Even with just basing the selection on the performance of one machine is
likely to give better results for the majority of machines (of that
architecture).
-Chris

-- 
Chris Wilson, Intel Open Source Technology Centre