[Pixman] [PATCH 2/2] sse2, mmx: Remove initial unaligned loops in fetchers

Sun Sep 1 10:32:37 PDT 2013

On Thu, 29 Aug 2013 05:59:26 +0200
sandmann at cs.au.dk (Søren Sandmann) wrote:

> Siarhei Siamashka <siarhei.siamashka at gmail.com> writes:
> 
> >> With this new alignment assumption, such an optimization becomes even more
> >> impossible, 
> >
> > Implementing this optimization does not seem to be too difficult in
> > principle. I tried to hack a bit and here is the result:
> >
> >     http://lists.freedesktop.org/archives/pixman/2013-August/002863.html
> >
> > Basically, we want to be able to update the pointer to the iterator
> > output buffer at any time in the middle of loop instead of setting it
> > in stone at the time of iterator creation.
> >
> > Updating this temporary buffer pointer could be added as a new iterator
> > method call (extending the existing get_scanline/write_back/fini). But
> > then supporting "only compatible with the aligned buffer" iterators
> > would be a little bit more tricky. Either way, the required code
> > modifications seem to be rather small.
> 
> I like this approach -- it's much simpler than my earlier proposal [1]
> where the iterator would have to be told about the destination stride
> and update the pointer itself, causing the iterator API to become much
> more complex. By just storing a pointer to a pointer, this all goes
> away.
> 
> The only thing I'm wondering is whether it would be cleaner to do it as
> a new fast path in pixman-fast-path.c instead of adding special cases to
> the general code? That way, it wouldn't even have to use a destination
> iterator, it could just maintain a destination pointer itself.

I have been experimenting with avoiding the redundant copy combiners
for the last few days. And it looks like we have such redundant copies
in a bit more cases than just fetch to a8r8g8b8 destination. The
pointer to pointer approach does not seem to address all of them
nicely, so now I'm actually leaning to the virtual function based
solution. But it's still more like an RFC (cleaner and/or more
functional proposals are welcome):

    http://lists.freedesktop.org/archives/pixman/2013-September/002879.html

BTW, there is one more hilarious case of redundant copies still to be
addressed. Such as converting some non-a8r8g8b8 format to itself which
still goes through the "fetch -> combine -> writeback" pipeline. There
are fast paths to solve this, like "sse2_composite_copy_area" and
"fast_composite_src_memcpy", but they only support a limited number of
formats and clutter the fast path tables:

    http://cgit.freedesktop.org/pixman/tree/pixman/pixman-sse2.c?id=pixman-0.30.2#n6150

One can always encounter something like "a4r4g4b4 -> a4r4g4b4" copy and
get into performance troubles.

> >> but in any it was always going to require a rather complex internal API
> >> to work because iterators are not required to use iter->buffer at all.
> >
> > Well, iter->buffer (or iter->direct_dest_buffer_ptr) is more like a
> > hint. If the iterator uses it, that's great. If it does not, then we
> > still can compare the returned pointer with the pointer to the current
> > destination image scanline. And fallback to using a combiner if the
> > pointers don't match.
> >
> > The first patch in your series (aligning the temporary buffer) is fine.
> > It's just that the second patch is dropping the handling of unaligned
> > pointers in the fetchers, and this code still could have some use:
> >
> >     http://lists.freedesktop.org/archives/pixman/2013-August/002864.html
> 
> The main reason for these patches is that I'm writing an SSSE3 bilinear
> iterator that really wants an aligned buffer to write to, so I'm fine
> with dropping the changes to the MMX/SSE2 iterators.

With the new iterator virtual function approach, the iterator may
actually check the buffer alignment and do direct fetch to the
destination buffer for some selected scanlines (which happen to have
suitable alignment). For the other scanlines it can fallback to using
the intermediate aligned buffer.

I'm also considering one more possible use. It might be a good idea
to co-align the temporary buffer of the source iterator with the
scanlines in the destination image (for the operators other than SRC).
Because the alignment of the scanlines in the destination image is out
of our control, we can just realign the temporary buffer on each
iteration. As a result, the inner loops of the combiners could do both
aligned vector reads and writes (should be particularly useful for
VMX/Altivec, which seems to be a bit messed up).

-- 
Best regards,
Siarhei Siamashka