[Pixman] [PATCH/RFC] Use OpenMP for bilinear scaled fast paths

Mon Jun 25 11:29:04 PDT 2012

Chris Wilson <chris at chris-wilson.co.uk> writes:

> On Mon, 25 Jun 2012 02:00:27 +0300, Siarhei Siamashka <siarhei.siamashka at gmail.com> wrote:
>> Does it actually make sense? I remember somebody was strongly opposing
>> the idea of spawning threads in pixman in the past, but can't find
>> this e-mail right now.

You may be remembering an IRC discussion about it, where Joonas was
opposed to libraries spawning threads:

    http://people.freedesktop.org/~sandmann/joonas-threads

> The only caveat from my point of view is that pixman_image_composite()
> must be atomic as the current cairo_image_surface_t is meant to be
> synchronous. Or at least API added so that I can serialise the

The main concern from me is making sure that it doesn't cause issues in
the X server, which is known to do wacky things with signals and
possibly threads. But the answer to that is to just put it in and get it
tested.

> operations within cairo_image_surface_t. In the past, I believe we've
> suggested grander schemes that that would require us to expose the
> asynchronous nature to the user. However, simply using OpenMP to
> parallise the kernels should not leak across the interface and so it is
> acceptable. So it just boils down to whether this make maintenance
> harder and interferes with future plans...

At some point, I think grander schemes will be useful, where grander
scheme might mean rolling our own thread pool and/or adding an
asynchronous API to pixman.

One case is radial gradients. These are generated through iterators, and
I am not sure that OpenMP is up to the task of parallelizing those. That
is, it doesn't seem likely that OpenMP can deal with code like this:

       iter_init (&src_iter, height);
       iter_init (&dest_iter, height);
       for (i = 0; i < height; ++i)
       {
           iter_fetch (&src_iter);
           iter_fetch (&dest_iter);
           combine ();
           iter_write (&dest_iter);
       }

But that doesn't mean that OpenMP can't be used for the tings that it
will deal with.

> Is there a way to hint to OpenMP how many threads to use? As we know the
> memory characteristics for most of the routines, do we not want to hint to
> OMP not to use more threads than required to saturate memory bw?

We know the memory characteristics, but the arithmetic characteristics
are less predictable. If some operation is doing a lot of arithmetic, we
want more threads for it.

What would be the performance impact of just parallelizing as much as
possible? I suppose if one thread can saturate the memory bandwidth,
having more threads would just pointlessly occopy more cores that could
be used for other purposes. I don't know how much of a concern that
actually is though.

I suppose a JIT compiler might be able to make an estimate of the number
of cycles per cache line accessed for the code it generated.

> Otherwise it's a big win for such a tiny patch! Just need to cross-check
> that we don't introduce regression on the older single-core no-cache
> chips. :(

Even if it is a small performance regression on single-core chips, I
still think it's worth it. Single-core chips are quickly becoming a
thing of the past, and we could offer a --disable-omp configure argument
for embedded systems where the CPU is known to be single-core ahead of
time.

Soren