[Pixman] Image scaling with bilinear interpolation performance

Thu Feb 17 08:40:47 PST 2011

김태균 <podain77 at gmail.com> writes:

> I'm using cairo and pixman to composite images.
> I tested image scaling performance with bilinear filtering.
> But it's a little bit slower than I expected.

Bilinear filtering is indeed one of the remaining places where pixman
could use some performance improvement, so thanks for looking into it.

> What I understand about image scaling procedure is as follows
> 
> 1. Allocate(or acquire statically allocated buffer) horizontal source
> scanline buffer to store interpolated pixels.
> 2. Fetch a scanline from source image to source scanline buffer with
> bilinear interpolation.
> 3. Composite source scanline and destination scanline with operator SRC.
> 4. Go to next scanline.
> 5. Free(or release) source scanline buffer.

Yes, this is how it works currently, though note that there have been
some changes since 0.21.2: we now use iterators to access the
images. The overall algorithm is still the same though.

> My optimization point is
> 
> 1. It seems possible to avoid fetching by directly combining bilinear
> interpolated result with destination.
> 2. Bilinear interpolation can be optimized using double-blend trick with a
> little bit of precision loss.
> 
> Optimization 1 can reduce one memory write and one memory read operation.
> I think it will not affect the performance significantly.
> However, it can be helpful on some machines with limited cache.

Indeed, it would likely be helpful to do this, even on machines with
where the temporary scanline fits in cache. There is some discussion
in this thread

    http://lists.freedesktop.org/archives/pixman/2011-January/000934.html

about one way to implement this optimization.

> Optimization 2 is more critical for image scaling with bilinear filtering.
> I wrote some codes for bilinear_interpolation at pixman-bits-image.c
> 
> // Optimization 2
> static force_inline uint32_t bilinear_interpolation (uint32_t tl, uint32_t
> tr, uint32_t bl, uint32_t br, int distx, int disty)
> {
>     int distxy, distxiy, distixy, distixiy;
>     uint32_t rb, ga;
> 
>     distxy = distx * disty;
>     distxiy = (disty << 8) - distxy;
>     distixy = (distx << 8) - distxy;
>     distixiy = 256*256 - (disty << 8) - (distx << 8) + distxy;
> 
>     distxy = distxy >> 8;
>     distxiy = distxiy >> 8;
>     distixy = distixy >> 8;
>     distixiy = distixiy >> 8;
> 
>     /* Red and Blue */
>     rb = (0x00FF00FF & tl)*distixiy + (0x00FF00FF & tr)*distixy +
> (0x00FF00FF & bl)*distxiy + (0x00FF00FF & br)*distxy;
>     rb = (rb >> 8) & 0x00FF00FF;
> 
>     /* Green and Alpha */
>     ga = (0x00FF00FF & (tl >> 8))*distixiy + (0x00FF00FF & (tr >> 8))*distixy +
> (0x00FF00FF & (bl >> 8))*distxiy + (0x00FF00FF & (br >> 8))*distxy;
>     ga = ga & 0xFF00FF00;
> 
>     return rb | ga;
> }
> 
> Optimization 2 increased the performance by more than twice.
> But it does not always produce the same result as the original code, which
> is also an approximation of bilinear interpolation.
> So it can cause pixel correctness issues on some test cases.

> I've done some numerical analysis and it appears that at most you would have
> a difference of 1 for each RGBA value as the original code.
> 
> Let me know if you need for information / explanation.

I'd like to see the analysis that show that the difference is at most
1. If the difference really is that small, then your code above looks
like a big performance win. (Although it looks to me like tr and bl
may have been swapped in the code).

Thanks,
Soren