[Pixman] [PATCH 2/2] sse2, mmx: Remove initial unaligned loops in fetchers

Wed Sep 4 08:10:16 PDT 2013

Siarhei Siamashka <siarhei.siamashka at gmail.com> writes:

> On Thu, 29 Aug 2013 05:59:26 +0200
> sandmann at cs.au.dk (Søren Sandmann) wrote:
>
>> Siarhei Siamashka <siarhei.siamashka at gmail.com> writes:
>> 
>> >> With this new alignment assumption, such an optimization becomes even more
>> >> impossible, 
>> >
>> > Implementing this optimization does not seem to be too difficult in
>> > principle. I tried to hack a bit and here is the result:
>> >
>> >     http://lists.freedesktop.org/archives/pixman/2013-August/002863.html
>> >
>> > Basically, we want to be able to update the pointer to the iterator
>> > output buffer at any time in the middle of loop instead of setting it
>> > in stone at the time of iterator creation.
>> >
>> > Updating this temporary buffer pointer could be added as a new iterator
>> > method call (extending the existing get_scanline/write_back/fini). But
>> > then supporting "only compatible with the aligned buffer" iterators
>> > would be a little bit more tricky. Either way, the required code
>> > modifications seem to be rather small.
>> 
>> I like this approach -- it's much simpler than my earlier proposal [1]
>> where the iterator would have to be told about the destination stride
>> and update the pointer itself, causing the iterator API to become much
>> more complex. By just storing a pointer to a pointer, this all goes
>> away.
>> 
>> The only thing I'm wondering is whether it would be cleaner to do it as
>> a new fast path in pixman-fast-path.c instead of adding special cases to
>> the general code? That way, it wouldn't even have to use a destination
>> iterator, it could just maintain a destination pointer itself.
>
> I have been experimenting with avoiding the redundant copy combiners
> for the last few days. And it looks like we have such redundant copies
> in a bit more cases than just fetch to a8r8g8b8 destination. The
> pointer to pointer approach does not seem to address all of them
> nicely, so now I'm actually leaning to the virtual function based
> solution. But it's still more like an RFC (cleaner and/or more
> functional proposals are welcome):
>
>     http://lists.freedesktop.org/archives/pixman/2013-September/002879.html

One thing that annoys me a bit about the new patches is that the
destination iterator is never allowed to write anything into the offer
buffer. That means that it can essentially only accept the offer if it
knows that it doesn't need to fetch anything.

Here is another proposal, but I'm not sure it's really better:

- The combiners are made to return a buffer. The returned buffer is
  expected to contain the combined result and may be any of the passed
  src/mask/dest buffers. Almost all combiners will continue to combine
  into the dest buffer, which they will also return. But the SRC
  combiner will simply return the source buffer without any memcpy()ing.

- The write_back() method is extended with an argument "buffer", which
  contains the bits that are supposed to be written back. In many cases
  this will be whatever the dest iterator returned from the
  get_scanline() call, but it can be be different.

  This means the noop dest iter will have to check whether the passed
  buffer is different from iter->buffer and call memcpy() if it is.

  The other dest iters will just have to use the buffer argument rather
  than iter->buffer as the source.

I believe this is sufficient to make cases 2, 3, and 4 from the commit
message work:

   2. bilinear a8r8g8b8 -> r5g6b5

   The bilinear iterator will fetch into iter->buffer; the combiner will
   simply return the source buffer, and the dest iter will then convert
   to r5g6b5 and store.

   3. a8r8g8b8 -> r5g6b5

   The noop src iterator will return pointers to the source image, and
   the combiner will simply return them. The dest iter will convert to
   565 and store.

   4. horizontal linear gradient -> r5g6b5

   Smilar to 3.

That leaves case 1 (bilinear a8r8g8b8 -> a8r8g8b8), which could be
handled with the pointer-to-pointer approach (and could possibly benefit
from being made a fast path (fast_composite_src_any_8888) that keeps
track of the destination pointer itself without using a dest iter).

But as mentioned, I'm not sure that this proposal is actually
better. One thing that seems a little dubious is that the noop dest
iterator will no longer actually be a noop (since in some cases it would
have to do a memcpy()). On the other hand, it avoids special cases in
pixman-general.c, and the amount of new code is fairly limited if we
ignore the trivial "return dest;" that would have to be added to all the
combiners.

> BTW, there is one more hilarious case of redundant copies still to be
> addressed. Such as converting some non-a8r8g8b8 format to itself which
> still goes through the "fetch -> combine -> writeback" pipeline. There
> are fast paths to solve this, like "sse2_composite_copy_area" and
> "fast_composite_src_memcpy", but they only support a limited number of
> formats and clutter the fast path tables:
>
>     http://cgit.freedesktop.org/pixman/tree/pixman/pixman-sse2.c?id=pixman-0.30.2#n6150
>
> One can always encounter something like "a4r4g4b4 -> a4r4g4b4" copy and
> get into performance troubles.

Yeah, the problem is that there is no good way for the fast path tables
to express "source and dest must have the same format", though I suppose
a new fast path flag:

    FAST_SAME_FORMAT_AS_SOURCE

could be added, that would then only be used for destinations.

>> The main reason for these patches is that I'm writing an SSSE3 bilinear
>> iterator that really wants an aligned buffer to write to, so I'm fine
>> with dropping the changes to the MMX/SSE2 iterators.
>
> With the new iterator virtual function approach, the iterator may
> actually check the buffer alignment and do direct fetch to the
> destination buffer for some selected scanlines (which happen to have
> suitable alignment). For the other scanlines it can fallback to using
> the intermediate aligned buffer.

Yes, this would be useful. We may also want to consider making the
stride a multiple of 16 bytes (and aligning the bits pointer) in
pixman_image_create_bits() to make it more likely that images are
aligned.

> I'm also considering one more possible use. It might be a good idea
> to co-align the temporary buffer of the source iterator with the
> scanlines in the destination image (for the operators other than SRC).
> Because the alignment of the scanlines in the destination image is out
> of our control, we can just realign the temporary buffer on each
> iteration. As a result, the inner loops of the combiners could do both
> aligned vector reads and writes (should be particularly useful for
> VMX/Altivec, which seems to be a bit messed up).

Good idea. Maybe I should just drop the alignment patch for now and make
the SSSE3 iterator use movdqu instead of movdqa? (Especially since I
missed the iter initialization in pixman-image.c where result also needs
to be aligned to 16 bytes).

Søren