[Pixman] [PATCH] sse2: Add a fast path for add_n_8888
Siarhei Siamashka
siarhei.siamashka at gmail.com
Wed Jan 2 10:54:34 PST 2013
On Wed, 02 Jan 2013 19:40:58 +0100
sandmann at cs.au.dk (Søren Sandmann) wrote:
> Chris Wilson <chris at chris-wilson.co.uk> writes:
>
> > This path is being exercised by inplace compositing of trapezoids, for
> > instance as used in the firefox-asteroids cairo-trace.
> >
> > core2 @ 2.66GHz,
> >
> > reference memcpy speed = 4898.2MB/s (1224.6MP/s for 32bpp fills)
> >
> > before: add_n_8888 = L1: 4.36 L2: 4.27 M: 1.61 ( 0.13%) HT:
> > 1.65 VT: 1.63 R: 1.63 RT: 1.59 ( 21Kops/s)
> >
> > after: add_n_8888 = L1:2969.09 L2:3926.11 M:603.30 ( 49.27%) HT:524.69
> > VT:401.01 R:407.59 RT:210.34 ( 804Kops/s)
>
> Just two brief comments, and then I'll disappear again (until the 11th
> or so):
>
> - It looks like this function will work for abgr destinations as well as
> argb.
>
> - I'm surprised that the new function is _that_ much better. The current
> code should hit an SSE2 combiner and noop iterators for both source
> and destination, so while I'd expect a solid improvement from a
> dedicated fast path, it is hard to believe that it would be 919 times
> faster than the old. If these numbers are real, there has to be
> something wrong with either the benchmark or the current code.
The "sse2_combine_add_u" combiner does not have a special path for zero
mask and this could be improved. But indeed, the difference is still
quite unexpectedly large.
--
Best regards,
Siarhei Siamashka
More information about the Pixman
mailing list