[Pixman] Resurrecting: Added fast path for "pad" type repeats

Mon Apr 7 02:22:15 PDT 2014

Hi,

this thread started from
http://lists.freedesktop.org/archives/pixman/2013-February/002619.html
and continued in
http://lists.freedesktop.org/archives/pixman/2013-March/002677.html

I'd like to hear what the thoughts of it are now, more questions below.

On Wed, 13 Feb 2013 08:37:06 +0100
sandmann at cs.au.dk (Søren Sandmann) wrote:

> I pushed the first patch to master. For this one, did you try comparing
> Chris' patch to your on ARMv6? Also, did we ever find out whether it was
> a bug in Firefox. I'm still somewhat skeptical that it's intended for a
> PAD image to be accessed so far out of bounds.

What is the "Chris' patch", has it been merged?

Do you still think there is a bug in Firefox, and the (trimmed) Cairo
traces are therefore invalid for PAD type repeats?

On Thu, 07 Mar 2013 23:16:04 +0100
sandmann at cs.au.dk (Søren Sandmann) wrote:

> Hi,
> 
> >> - The flags flags of images should be computed with
> >>   _pixman_image_validate(), not just overwritten.
> >
> > You mean the line that clears out the repeat bits and sets
> > FAST_PATH_SAMPLES_COVER_CLIP_NEAREST? That sounds relatively expensive,
> > and also wouldn't set FAST_PATH_SAMPLES_COVER_CLIP_NEAREST (that's mainly
> > done by analyze_extent(), which isn't called by _pixman_image_validate(),
> > and which is necessary to match the most useful child fast paths).
> 
> I didn't actually mean that particular line, I was only talking about
> validating the newly created images. But it's a good point that
> COVER_CLIP_NEAREST wouldn't get set.
> 
> > One other thing while this reminds me: the way
> > fast_composite_tiled_repeat() is matched in the fast path list means that
> > it won't match something like an over_n_8_8888 operation where the mask
> > is tiled (but the source isn't, of course, being a solid image). Although
> > this seems a reasonable operation, I wasn't able to find any instances in
> > the cairo-perf-traces, so didn't bother worrying about it any further.
> 
> It would be nice if there were a more general way for a fast path to be
> recursive, where the recursive calls can take full advantage of the
> optimizations done by pixman_image_composite32(), including:
> 
>     - 'strength reduction' in optimize_operator()
>     - all flags, including COVER_CLIP
> 
> and where the overhead of recursion is not too devastating. (And also
> avoid forcing recursion on the backends, where they might want to do the
> entire operation by themselves).
> 
> So maybe we should just declare it possible to call
> pixman_image_composite32, or some internal variant of it, recursively as
> long as the images involved in the recursive call were different than
> the ones in the outer image, and then deal with whatever bugs and
> performance issues surface.
> 
> Then composite_tiled() could be broken into two functions:
> 
>      composite_tiled_src()
>      composite_tiled_mask()
> 
> where composite_tiled_src() would create a clone of the src image that
> had REPEAT_NORMAL disabled and then call pixman_image_composite32()
> recursively, which in turn, if the mask was also repeating, would end up
> calling composite_tiled_mask().
> 
> Similarly, composite_repeat_pad() would create nine cloned images along
> the lines of the pseudo code I posted and make nine recursive calls to
> pixman_image_composite32(). There could be additional special cased
> support for 1xn/NORMAL images that would then automatically be used by
> both the tiled and pad fast paths.
> 
> So two pieces of infrastructure would be needed: The ability to create
> temporary subimages with different repeat modes, and the ability to call
> pixman_image_composite32() recursively.
> 
> What I like about this scheme is that it makes the code reflect exactly
> what's going on: composite operations are being broken down into smaller
> composite operation. It really is recursion; making the code reflect
> this is a good thing.
> 
> There is an obvious danger that the overhead from creating clone images
> and calling pixman_image_composite32() would be too high, but it is
> important to note that pixman spends most of its time actually
> compositing and ones intuition about per-composite overhead can easily
> be wrong. (Especially now that glyph compositing has explicit support).
> 
> However, if overhead turns out to be an issue, there are potential fixes
> such as caching temporary images and making an internal variant of
> pixman_image_composite32() that would drop the region computations etc.
> Some such fixes could benefit the rest of pixman as well.

New infrastructure...

> >> - You added a simple test program (thanks for that, that already puts
> >>   you ahead of most people), but it only checks one single composite
> >>   operation. This fast path is complex enough that I think the test
> >>   should verify many more alignments of source and destination (and
> >>   mask if support for that is kept), and more formats and operators.
> >
> > I think I assumed that the existing fuzzers would already be doing a
> > reasonable job of that; this test was mainly just something that I put
> > together so I could quickly check visually that I hadn't made any major
> > mistakes in implementation (hence the switched out printf statements),
> > and I added it to the patch series more as an afterthought. I assume you
> > think we'd benefit from a fuzzer that particularly focuses on repeat
> > types then?
> 
> As far as I can tell, blitters-test doesn't test PAD at all, so if we
> are going to have special-cased support for it, then it would be
> beneficial to have testing for it.
> 
> >> An issue here is that if the single pixels in the corners are opaque and
> >> the operator is OVER, we'd ideally want to do the 'strength reduction'
> >> from optimize_operator() in pixman.c to convert the OVER to SRC instead
> >> so that the operation becomes a solid fill (or even a noop) rather than
> >> an over_n_8888().
> >
> > In case you didn't notice, I got round that in my subsequent patches my
> > putting that test into my implementations of over_n_8888 and falling back
> > to fast_composite_solid_fill() or pixman_composite_src_n_8888_asm_armv6()
> > respectively.
> 
> I did see that, but in principle, there could be a faster version of
> solid fill available at a higher-grade implementation. E.g., if
> over_n_8888 is available for mmx, but not sse2, but solid fill is
> available for both, you'd want to pick the SSE2 solid fill.
> 
> The SSE2 variant of over_n_8888() doesn't have the reduction to solid
> fill btw.

I didn't understand much about the whole discussion in the email
thread, but I gathered these points:

1. The first thing would be to write new (fuzzer) tests that hit the PAD
   repeat type with enough variations.

2. The PAD repeat case is split into 9 smaller composites, each
   optimized separately to take advantage of solid color and 1xn images,
   somehow.

3. Profit?

After that is completed, I could start looking at the rest of Ben's
patches, which rely on this decomposition to show measurable
performance improvements on RaspberryPi.

In his original email, Ben showed a speedup of "3.86x" in
t-firefox-chalkboard trimmed Cairo trace on ARMv6.

However, it didn't come clear to me, whether this "pad" kind of
optimization would ever be accepted upstream, or what exactly an
acceptable approach would be. Could you shed some light on that?

Apparently it should go on the "new infrastructure" way instead?

I'm trying to gather some understanding on how much work it would be to
get the "3.86x" improvement upstream, and whether I as a Pixman-novice
would be able to pull it off in reasonable time.

Thanks,
pq