[Pixman] [PATCH] sse2: Add a fast path for add_n_8888

Wed Jan 2 11:57:19 PST 2013

On Wed, 02 Jan 2013 19:12:21 +0000
Chris Wilson <chris at chris-wilson.co.uk> wrote:

> On Wed, 02 Jan 2013 19:40:58 +0100, sandmann at cs.au.dk (=?utf-8?Q?S=C3=B8ren?= Sandmann) wrote:
> > Chris Wilson <chris at chris-wilson.co.uk> writes:
> > 
> > > This path is being exercised by inplace compositing of trapezoids, for
> > > instance as used in the firefox-asteroids cairo-trace.
> > >
> > > core2 @ 2.66GHz,
> > >
> > > reference memcpy speed = 4898.2MB/s (1224.6MP/s for 32bpp fills)
> > >
> > > before: add_n_8888 = L1:   4.36  L2:   4.27  M:  1.61 (  0.13%)  HT:
> > > 1.65  VT:  1.63  R:  1.63  RT:  1.59 (  21Kops/s)
> > >
> > > after:  add_n_8888 = L1:2969.09  L2:3926.11  M:603.30 ( 49.27%)  HT:524.69
> > > VT:401.01  R:407.59  RT:210.34 ( 804Kops/s)
> > 
> > Just two brief comments, and then I'll disappear again (until the 11th
> > or so):
> > 
> > - It looks like this function will work for abgr destinations as well as
> >   argb.
> > 
> > - I'm surprised that the new function is _that_ much better. The current
> >   code should hit an SSE2 combiner and noop iterators for both source
> >   and destination, so while I'd expect a solid improvement from a
> >   dedicated fast path, it is hard to believe that it would be 919 times
> >   faster than the old. If these numbers are real, there has to be
> >   something wrong with either the benchmark or the current code.
> 
> Judging from the perf profile of cairo-traces, the delta is closer to 5x.
> All I did to gather the numbers was to run 
>    ./test/lowlevel-blt-bench -n add_n_8888
> which is dominated by general_composite_rect:
> 
>  if (repeat == PIXMAN_REPEAT_NORMAL)
>  {
> 	while (*c >= size)
> 		*c -= size;
> 	while (*c < 0)
> 		*c += size;
>  }
> 
> special casing size==1 there boosts the L1 results from 4 to 70, but it
> still surprising that we hit that path at all.

The "-n" option benchmarks nearest scaling. In this particular test we
hit general scaled fetch for "1x1 image with normal repeat", which is
equivalent to "solid". It is impressive how you managed to find such a
poorly performing case, and it shows quite a number of issues.

The most interesting is the use of repeat() inline function
    http://cgit.freedesktop.org/pixman/tree/pixman/pixman-inlines.h?id=pixman-0.28.2#n56
which is called from bits_image_fetch_affine_no_alpha() for fetching
individual pixels:
    http://cgit.freedesktop.org/pixman/tree/pixman/pixman-bits-image.c?id=pixman-0.28.2#n459

The problem here is that this repeat() function is optimized for
quickly returning the current pixel position back into image bounds
when sequentially walking over pixels and is trying to avoid the
expensive modulo operator. But it happens to also get used for
fetching individual pixels at arbitrarily large coordinates on
the general path. And the loop may run for a really ridiculous
number of iterations, so 5x or 900x slowdown is not so surprising
here.

> Ah, read the options to lowlevel-blt-bench wrong...
> 
> ./test/lowlevel-blt-bench add_n_8888:
>    add_n_8888 =  L1:1131.58  L2:1112.37  M:530.11 ( 43.24%)  HT:108.01  VT:
> 99.03  R: 90.03  RT: 25.11 ( 306Kops/s)

This surely looks more reasonable. And maybe adding a shortcut to the
combiner for skipping the mask would make it somewhat closer to the
performance of the new sse2_add_n_8888 fast path

> > > after:  add_n_8888 = L1:2969.09  L2:3926.11  M:603.30 ( 49.27%)  HT:524.69
> > > VT:401.01  R:407.59  RT:210.34 ( 804Kops/s)

-- 
Best regards,
Siarhei Siamashka