[Pixman] Image scaling with bilinear interpolation performance

Mon Feb 21 03:07:31 PST 2011

Hi,
Thank you for the reply.

> Regarding performance, improving it twice is still a little bit too slow
on the
> hardware which has SIMD. On x86, support for SSE2 is pretty much common,
so it
> is quite natural to use it if it proves to be beneficial. But for the low
end
> embedded machines with primitive processors without SIMD it may be indeed
very
> good to have any kind of performance improvements.

Yes, right.
I will fully utilize SIMD as possible as I can. (NEON is available on some
of our target machines)
But I have to consider not only high end machines but also low ends which do
not support SIMD.
That's why I'm trying to optimize non-SIMD general code path.

So let's back to the accuracy.

I was wrong about that error would be at most difference of 1.
The upper bound of error is 2.

Following is my analysis on error between original code and optimized code.
(bilinear interpolated result is a weighted sum of 4 values, let's
substitute tl, tr, bl, br with a, b, c, d for simplification)

original code : r = a*t0 + b*t1 + c*t2 + d*t3 (in 24 bits precision)
optimized code : r' = a*(t0 >> 8) + b*(t1 >> 8) + c*(t2 >> 8) + d*(t3 >> 8)
(in 16 bits precision)
where t0 + t1 + t2 + t3 = 0x10000

Now we split "t" into two terms u, v where u is upper 8 bits of t and v is
lower 8 bits of t. (note that t0 = u0*256 + v0, t0 >> 8 = u0)

So,

r' = a*u0 + b*u1 + c*u2 + d*u3

r = a*(u0*256 + v0) + b*(u1*256 + v1) + c*(u2*256 + v2) + d*(u3*256 + v3)
  = 256*(a*u0 + b*u1 + c*u2 + d*u3) + a*v0 + b*v1 + c*v2 + d*v3
  = 256*r' + a*v0 + b*v1 + c*v2 + d*v3

Error would be
e = (r - (r' << 8)) >> 16 = (r - 256*r') >> 16 = (a*v0 + b*v1 + c*v2 + d*v3)
>> 16

Each value a, b, c and d can be 0xff at most, So

max(e) = (0xff*(v0 + v1 + v2 + v3)) >> 16 = (0xff*max(v0 + v1 + v2 + v3)) >>
16

max(v0 + v1 + v2 + v3) = 0x300 (because lower 8 bits of t0 + t1 + t2 + t3
should be 0x00)

So max(e) = (0xff*0x300) >> 16 = 2

But this does not satisfy rule 5 as you mentioned

> Wouldn't it be preferred to have (distxy + distxiy + distixy + distixiy)
== 256
> here? My guess is that it may be not always the case based on looking at
your
> code. Which will be a violation of rule "5. Resampling a solid color
should
> give a solid color" from
http://www.virtualdub.org/blog/pivot/entry.php?id=86

I slightly modified the code to satisfy rule 5.
I reduced precision of distx and disty from 8 bits to 4 bits.

static force_inline uint32_t bilinear_interpolation (uint32_t tl,
uint32_t tr,
uint32_t bl, uint32_t br, int distx, int disty)
{
    int distixiy, distxiy, distixy, distxy;
    uint32_t rb, ga;

    distx = distx >> 4;
    disty = disty >> 4;

    distxy = distx * disty;
    distixy = (disty << 4) - distxy;
    distxiy = (distx << 4) - distxy;
    distixiy = 256 - (disty << 4) - (distx << 4) + distxy;

    rb = (0x00FF00FF & tl)*distixiy + (0x00FF00FF & tr)*distxiy +
(0x00FF00FF & bl)*distixy + (0x00FF00FF & br)*distxy;
    rb = (rb >> 8) & 0x00FF00FF;

    ga = (0x00FF00FF & (tl >> 8))*distixiy + (0x00FF00FF & (tr >>
8))*distxiy + (0x00FF00FF & (bl >> 8))*distixy + (0x00FF00FF & (br >>
8))*distxy;
    ga = ga & 0xFF00FF00;

    return rb | ga;
}

Now we have distxy + distixy + distxiy + distixiy == 256.

> Now regarding accuracy. I have added some comments above regarding the
> potential solid color issue, but this should be relatively easy to
address. I'm
> also a bit worried about one more thing (in the original pixman code too,
but
> let's cover this too while we are discussing accuracy in general).
Wouldn't it
> be a good idea to do shift with rounding for the final value instead of
> dropping the fractional part? And the 'distx'/'disty' variables are also
> obtained by right shifting 'ux' by 8 and dropping fractional part, maybe
> rounding would be more appropriate. Not doing rounding might cause slight
image
> drift to the left (and top) on repeated rescaling, and also slight
reduction of
> average brightness.

I agree with that rounding is more appropriate.
I think supplying distx and disty as properly rounded 4 bits values to
interpolation function is the best choice we have.

Analysis on error is some what complicated in this case.
Error may be bigger than previous code, at least 15 (I've done some brute
force jobs)

> I have only one concern about testing. Supposedly when we get both C and
SSE2
> implementations, it would be much easier for testing if they produce
identical
> results. Otherwise tests need to be improved to somehow be able to take
slight
> differences into account.

I think the requirement of producing same results for both C & SIMD(maybe
sse2, NEON, mmx) is relatively easy.
But SIMD can produce much better result with less time spent, which can be
horribly slow with general C implementation.
I think it is much desirable to keep both C and SIMD code optimized in spite
of producing slightly different results.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.freedesktop.org/archives/pixman/attachments/20110221/b8461566/attachment.htm>