[Pixman] [PATCH] sse2: Add a fast path for add_n_8888

Wed Jan 2 10:40:58 PST 2013

Chris Wilson <chris at chris-wilson.co.uk> writes:

> This path is being exercised by inplace compositing of trapezoids, for
> instance as used in the firefox-asteroids cairo-trace.
>
> core2 @ 2.66GHz,
>
> reference memcpy speed = 4898.2MB/s (1224.6MP/s for 32bpp fills)
>
> before: add_n_8888 = L1:   4.36  L2:   4.27  M:  1.61 (  0.13%)  HT:
> 1.65  VT:  1.63  R:  1.63  RT:  1.59 (  21Kops/s)
>
> after:  add_n_8888 = L1:2969.09  L2:3926.11  M:603.30 ( 49.27%)  HT:524.69
> VT:401.01  R:407.59  RT:210.34 ( 804Kops/s)

Just two brief comments, and then I'll disappear again (until the 11th
or so):

- It looks like this function will work for abgr destinations as well as
  argb.

- I'm surprised that the new function is _that_ much better. The current
  code should hit an SSE2 combiner and noop iterators for both source
  and destination, so while I'd expect a solid improvement from a
  dedicated fast path, it is hard to believe that it would be 919 times
  faster than the old. If these numbers are real, there has to be
  something wrong with either the benchmark or the current code.

Soren