[Pixman] Image scaling with bilinear interpolation performance

Tue Feb 22 08:29:29 PST 2011

김태균 <podain77 at gmail.com> writes:

> original code : r = a*t0 + b*t1 + c*t2 + d*t3 (in 24 bits precision)
> optimized code : r' = a*(t0 >> 8) + b*(t1 >> 8) + c*(t2 >> 8) + d*(t3 >> 8)
> (in 16 bits precision)
> where t0 + t1 + t2 + t3 = 0x10000
> 
> Now we split "t" into two terms u, v where u is upper 8 bits of t and v is
> lower 8 bits of t. (note that t0 = u0*256 + v0, t0 >> 8 = u0)
> 
> So,
> 
> r' = a*u0 + b*u1 + c*u2 + d*u3
> 
> r = a*(u0*256 + v0) + b*(u1*256 + v1) + c*(u2*256 + v2) + d*(u3*256 + v3)
>   = 256*(a*u0 + b*u1 + c*u2 + d*u3) + a*v0 + b*v1 + c*v2 + d*v3
>   = 256*r' + a*v0 + b*v1 + c*v2 + d*v3
> 
> Error would be
> e = (r - (r' << 8)) >> 16 = (r - 256*r') >> 16 = (a*v0 + b*v1 + c*v2 + d*v3)
> >> 16
> 
> Each value a, b, c and d can be 0xff at most, So
> 
> max(e) = (0xff*(v0 + v1 + v2 + v3)) >> 16 = (0xff*max(v0 + v1 + v2 + v3)) >>
> 16
> 
> max(v0 + v1 + v2 + v3) = 0x300 (because lower 8 bits of t0 + t1 + t2 + t3
> should be 0x00)
> 
> So max(e) = (0xff*0x300) >> 16 = 2
> 
> But this does not satisfy rule 5 as you mentioned

Thanks for doing this analysis. A difference of just 2 would be fine
in my opinion, and as you mention the original code was an
approximation as well.

It would be possible to satisfy rule 5 using a kind of error
diffusion, as as demonstrated by this program:

    static void
    compute_weights (uint8_t distx, uint8_t disty)
    {
        uint32_t distxy, distxiy, distixy, distixiy;
        int e, t;

        distxy = distx * disty;
        distxiy = (distx << 8) - distxy;
        distixy = (disty << 8) - distxy;
        distixiy = 256 * 256 - (disty << 8) - (distx << 8) + distxy;

        t = distxy + 0x80;
        e = (t & 0xffffff00) - distxy;
        distxy = t >> 8;

        distxiy -= e;
        t = distxiy + 0x80;
        e = (t & 0xffffff00) - distxiy;
        distxiy = t >> 8;

        distixy -= e;
        t = distixy + 0x80;
        e = (t & 0xffffff00) - distixy;
        distixy = t >> 8;

        distixiy -= e;
        t = distixiy + 0x80;
        e = (t & 0xffffff00) - distixiy;
        distixiy = t >> 8;

        assert (distxy + distxiy + distixy + distixiy == 256);
    }

    int
    main ()
    {
        int i, j;

        for (i = 0; i < 256; ++i)
        {
            for (j = 0; j < 256; ++j)
                compute_weights (i, j);
        }
    }

although that does do a bit more arithmetic than your code.

> > Now regarding accuracy. I have added some comments above regarding the
> > potential solid color issue, but this should be relatively easy to
> address. I'm
> > also a bit worried about one more thing (in the original pixman code too,
> but
> > let's cover this too while we are discussing accuracy in general).
> Wouldn't it
> > be a good idea to do shift with rounding for the final value instead of
> > dropping the fractional part? And the 'distx'/'disty' variables are also
> > obtained by right shifting 'ux' by 8 and dropping fractional part, maybe
> > rounding would be more appropriate. Not doing rounding might cause slight
> image
> > drift to the left (and top) on repeated rescaling, and also slight
> reduction of
> > average brightness.
> 
> I agree with that rounding is more appropriate.
> I think supplying distx and disty as properly rounded 4 bits values to
> interpolation function is the best choice we have.
> 
> Analysis on error is some what complicated in this case.
> Error may be bigger than previous code, at least 15 (I've done some brute
> force jobs)

Rounding to four bits is going to be a quite visible drop in quality
though, especially if you zoom more than 16x. With four bits of
precision, there will be only 16 different colors in the gradients
generated by the filter, which will show up as banding.

But maybe it's good enough - 16x scaling is not going to look great
with bilinear filtering no matter what.

> > I have only one concern about testing. Supposedly when we get both C and
> SSE2
> > implementations, it would be much easier for testing if they produce
> identical
> > results. Otherwise tests need to be improved to somehow be able to take
> slight
> > differences into account.
> 
> I think the requirement of producing same results for both C & SIMD(maybe
> sse2, NEON, mmx) is relatively easy.
> But SIMD can produce much better result with less time spent, which can be
> horribly slow with general C implementation.
> I think it is much desirable to keep both C and SIMD code optimized in spite
> of producing slightly different results.

Having the C and SIMD code produce different results is not a problem
in itself, but as Siarhei says, but we would need to make sure the
test suite reflects that decision.

If we decide to move away from bit-exact testing, we would need to
decide on an acceptable deviation from ideal, and then update the
tests to verify that both the C and SIMD implementations are within
that deviation.

For example there could be a reference implementation that computes
the bilinear filtering in double precision floating point, adds +/-
x%, then converts that range into the target bitdepth and finally
verifies that the output is within that range.

Soren