[cairo] [PATCH/RFC][pixman] More ARM NEON performance updates

Thu Dec 10 13:56:50 PST 2009

Siarhei Siamashka <siarhei.siamashka at gmail.com> writes:

> 2. Some fetch/store functions (r5g6b5 format is the most interesting) benefit
> from SIMD optimizations a lot, at least for ARM NEON:
> 
> http://cgit.freedesktop.org/~siamashka/pixman/log/?h=fetch-r5g6b5-arm-neon
> 
> This is a little bit inconsistent with the other SIMD optimizations which are
> handled via pixman_implementation_t. So I'm all open to any suggestions about
> how to do it in a right way.

First, I think architecture specific fetchers are a very good
idea. There are a couple of bugs in bugzilla with SSE2 fetchers for
some formats, and both gradients and bilinear scaling could become
much faster with architecture specific code.

The way I have been thinking about is to have implementations involved
when the images are created. During the creation they could then plug
in their own fetchers. So something along these lines:

- The pixman_image struct will be renamed to something like
  pixman_image_common, and it will contain the set of properties that
  describe the image completely. Eg., it will contain the
  transformation and the filter since these are inherent in what the
  image *is*. It will not contain any of the fetcher functions etc.,
  because those are essentially just caches - they could be recomputed
  from the generic struct if necessary.

- A pixman_image will then be something that the implementation can
  create, and it will contain

        - a pointer to the pixman_image_common.
        - fetch/store scanline functions
        - a property changed function
        - a pointer to a fallback pixman_image
        - whatever other information the implementations want to cache
          about the image.

- The fetch and store functions can then either do the fetching if
  they know how to, or they can fall back to the fetch/store in the
  fallback image.

So, pixman_image_create_bits() would create the common struct, then
call the implementation's create_bits_image(). That function would
fill in the property_changed() function.

The property_changed() function would fill in the fetch_scanline slot
with either an architecture specific fetcher or a delegate call that
would call fetch_scanline() for the next image in the fallback chain.

As with the implementation delegates, if you can find a simpler setup,
I wouldn't be opposed to it, as long as it can do these things:

        - Allows fallbacks from SSE2->MMX->fast->generic

        - Doesn't rule out fetchers for gradients 

Thanks,
Soren