[Pixman] Resurrecting: Added fast path for "pad" type repeats

Mon Apr 21 18:38:36 PDT 2014

Pekka Paalanen <ppaalanen at gmail.com> writes:

> Hi,
>
> this thread started from
> http://lists.freedesktop.org/archives/pixman/2013-February/002619.html
> and continued in
> http://lists.freedesktop.org/archives/pixman/2013-March/002677.html
>
> I'd like to hear what the thoughts of it are now, more questions below.
>
>
> On Wed, 13 Feb 2013 08:37:06 +0100
> sandmann at cs.au.dk (Søren Sandmann) wrote:
>
>> I pushed the first patch to master. For this one, did you try comparing
>> Chris' patch to your on ARMv6? Also, did we ever find out whether it was
>> a bug in Firefox. I'm still somewhat skeptical that it's intended for a
>> PAD image to be accessed so far out of bounds.
>
> What is the "Chris' patch", has it been merged?
>
> Do you still think there is a bug in Firefox, and the (trimmed) Cairo
> traces are therefore invalid for PAD type repeats?

The "Chris' patch" is this:

    http://lists.freedesktop.org/archives/pixman/2013-January/002572.html

I still believe what I wrote here:

    http://lists.freedesktop.org/archives/pixman/2013-January/002600.html

that an operation where

     - the repeat mode is PAD
     - the transformation is identity
     - the source image is smaller than the destination (but it is not a
       1 x n image)

suggests that something suboptimal is going on at a higher level than
pixman because the visual results will be an image surrounded by weird
horizontal and vertical stripes, which is not normally desirable. And if
I remember correctly, the firefox-chalkboard source image is actually
some kind of noise mask that really looks like it should be used with
repeat=NORMAL, not repeat=PAD.

Still, if this operation is being generated by Firefox (and I'd like to
make sure that this is not just an artefact of cairo-trace or something
like that), pixman should make it fast, so I am not opposed to the
general idea of optimizing it. So a good patch would be able to make it
upstream.

> On Thu, 07 Mar 2013 23:16:04 +0100
> sandmann at cs.au.dk (Søren Sandmann) wrote:
>
>> Hi,
>> 
>> >> - The flags flags of images should be computed with
>> >>   _pixman_image_validate(), not just overwritten.
>> >
>> > You mean the line that clears out the repeat bits and sets
>> > FAST_PATH_SAMPLES_COVER_CLIP_NEAREST? That sounds relatively expensive,
>> > and also wouldn't set FAST_PATH_SAMPLES_COVER_CLIP_NEAREST (that's mainly
>> > done by analyze_extent(), which isn't called by _pixman_image_validate(),
>> > and which is necessary to match the most useful child fast paths).
>> 
>> I didn't actually mean that particular line, I was only talking about
>> validating the newly created images. But it's a good point that
>> COVER_CLIP_NEAREST wouldn't get set.
>> 
>> > One other thing while this reminds me: the way
>> > fast_composite_tiled_repeat() is matched in the fast path list means that
>> > it won't match something like an over_n_8_8888 operation where the mask
>> > is tiled (but the source isn't, of course, being a solid image). Although
>> > this seems a reasonable operation, I wasn't able to find any instances in
>> > the cairo-perf-traces, so didn't bother worrying about it any further.
>> 
>> It would be nice if there were a more general way for a fast path to be
>> recursive, where the recursive calls can take full advantage of the
>> optimizations done by pixman_image_composite32(), including:
>> 
>>     - 'strength reduction' in optimize_operator()
>>     - all flags, including COVER_CLIP
>> 
>> and where the overhead of recursion is not too devastating. (And also
>> avoid forcing recursion on the backends, where they might want to do the
>> entire operation by themselves).
>> 
>> So maybe we should just declare it possible to call
>> pixman_image_composite32, or some internal variant of it, recursively as
>> long as the images involved in the recursive call were different than
>> the ones in the outer image, and then deal with whatever bugs and
>> performance issues surface.
>> 
>> Then composite_tiled() could be broken into two functions:
>> 
>>      composite_tiled_src()
>>      composite_tiled_mask()
>> 
>> where composite_tiled_src() would create a clone of the src image that
>> had REPEAT_NORMAL disabled and then call pixman_image_composite32()
>> recursively, which in turn, if the mask was also repeating, would end up
>> calling composite_tiled_mask().
>> 
>> Similarly, composite_repeat_pad() would create nine cloned images along
>> the lines of the pseudo code I posted and make nine recursive calls to
>> pixman_image_composite32(). There could be additional special cased
>> support for 1xn/NORMAL images that would then automatically be used by
>> both the tiled and pad fast paths.
>> 
>> So two pieces of infrastructure would be needed: The ability to create
>> temporary subimages with different repeat modes, and the ability to call
>> pixman_image_composite32() recursively.
>> 
>> What I like about this scheme is that it makes the code reflect exactly
>> what's going on: composite operations are being broken down into smaller
>> composite operation. It really is recursion; making the code reflect
>> this is a good thing.
>> 
>> There is an obvious danger that the overhead from creating clone images
>> and calling pixman_image_composite32() would be too high, but it is
>> important to note that pixman spends most of its time actually
>> compositing and ones intuition about per-composite overhead can easily
>> be wrong. (Especially now that glyph compositing has explicit support).
>> 
>> However, if overhead turns out to be an issue, there are potential fixes
>> such as caching temporary images and making an internal variant of
>> pixman_image_composite32() that would drop the region computations etc.
>> Some such fixes could benefit the rest of pixman as well.
>
> New infrastructure...
>
>> >> - You added a simple test program (thanks for that, that already puts
>> >>   you ahead of most people), but it only checks one single composite
>> >>   operation. This fast path is complex enough that I think the test
>> >>   should verify many more alignments of source and destination (and
>> >>   mask if support for that is kept), and more formats and operators.
>> >
>> > I think I assumed that the existing fuzzers would already be doing a
>> > reasonable job of that; this test was mainly just something that I put
>> > together so I could quickly check visually that I hadn't made any major
>> > mistakes in implementation (hence the switched out printf statements),
>> > and I added it to the patch series more as an afterthought. I assume you
>> > think we'd benefit from a fuzzer that particularly focuses on repeat
>> > types then?
>> 
>> As far as I can tell, blitters-test doesn't test PAD at all, so if we
>> are going to have special-cased support for it, then it would be
>> beneficial to have testing for it.
>> 
>> >> An issue here is that if the single pixels in the corners are opaque and
>> >> the operator is OVER, we'd ideally want to do the 'strength reduction'
>> >> from optimize_operator() in pixman.c to convert the OVER to SRC instead
>> >> so that the operation becomes a solid fill (or even a noop) rather than
>> >> an over_n_8888().
>> >
>> > In case you didn't notice, I got round that in my subsequent patches my
>> > putting that test into my implementations of over_n_8888 and falling back
>> > to fast_composite_solid_fill() or pixman_composite_src_n_8888_asm_armv6()
>> > respectively.
>> 
>> I did see that, but in principle, there could be a faster version of
>> solid fill available at a higher-grade implementation. E.g., if
>> over_n_8888 is available for mmx, but not sse2, but solid fill is
>> available for both, you'd want to pick the SSE2 solid fill.
>> 
>> The SSE2 variant of over_n_8888() doesn't have the reduction to solid
>> fill btw.
>
> I didn't understand much about the whole discussion in the email
> thread, but I gathered these points:
>
> 1. The first thing would be to write new (fuzzer) tests that hit the PAD
>    repeat type with enough variations.
>
> 2. The PAD repeat case is split into 9 smaller composites, each
>    optimized separately to take advantage of solid color and 1xn images,
>    somehow.
>
> 3. Profit?

Yes, this is basically right, though the order of 1 and 2 could be
reversed step 2 is the one that may turn out to be impossible. And the
question mark in step 3 is real - it could be that this works, but isn't
actually a performance improvement.

> After that is completed, I could start looking at the rest of Ben's
> patches, which rely on this decomposition to show measurable
> performance improvements on RaspberryPi.

Yes, although it might be simpler to extend lowlevel-blt-bench to hit
some of these cases.

> In his original email, Ben showed a speedup of "3.86x" in
> t-firefox-chalkboard trimmed Cairo trace on ARMv6.
>
> However, it didn't come clear to me, whether this "pad" kind of
> optimization would ever be accepted upstream, or what exactly an
> acceptable approach would be. Could you shed some light on that?
>
> Apparently it should go on the "new infrastructure" way instead?

The "new infrastructure" here is probably not as work-intensive as you
think. It basically boils down to 

       1. Add a function that makes a clone of a pixman_image_t struct
          using a different rectangle of the data bits.

       2. Change Ben's patch to call pixman_image_composite32() on
          clones of the input image recursively -- along the lines of
          the pseudo code in

       http://lists.freedesktop.org/archives/pixman/2013-March/002677.html

As I mentioned in one of the mails, I wasn't thrilled about Ben's patch,
but I don't have a good alternate proposal. The above is my best guess
at a way to do it, but there is a real possibility that it won't work
that well.

Sorry if this is an unsatisfactory answer.

Søren