<div>Hi,</div>Thank you for the reply.<div><br></div><div><meta http-equiv="content-type" content="text/html; charset=utf-8"><span class="Apple-style-span" style="border-collapse: collapse; font-family: arial, sans-serif; font-size: 13px; "><font class="Apple-style-span" color="#6600CC">> Regarding performance, improving it twice is still a little bit too slow on the<br>
> hardware which has SIMD. On x86, support for SSE2 is pretty much common, so it<br>> is quite natural to use it if it proves to be beneficial. But for the low end<br>> embedded machines with primitive processors without SIMD it may be indeed very<br>
> good to have any kind of performance improvements.</font></span></div><div><font class="Apple-style-span" face="arial, sans-serif"><span class="Apple-style-span" style="border-collapse: collapse;"><br></span></font></div>
<div><font class="Apple-style-span" face="arial, sans-serif"><span class="Apple-style-span" style="border-collapse: collapse;">Yes, right.</span></font></div><div><font class="Apple-style-span" face="arial, sans-serif"><span class="Apple-style-span" style="border-collapse: collapse;"><meta http-equiv="content-type" content="text/html; charset=utf-8">I will fully utilize SIMD as possible as I can. (NEON is available on some of our target machines)</span></font></div>
<div><font class="Apple-style-span" face="arial, sans-serif"><span class="Apple-style-span" style="border-collapse: collapse;"><meta http-equiv="content-type" content="text/html; charset=utf-8">But I have to consider not only high end machines but also low ends which do not support SIMD.</span></font></div>
<div><font class="Apple-style-span" face="arial, sans-serif"><span class="Apple-style-span" style="border-collapse: collapse;">That's why I'm trying to optimize non-SIMD general code path.</span></font></div><meta http-equiv="content-type" content="text/html; charset=utf-8"><div>
<div><br></div><div>So let's back to the accuracy.</div><div><br></div><div>I was wrong about that error would be at most difference of 1.</div><div>The upper bound of error is 2.</div><div><br></div><div>Following is my analysis on error between original code and optimized code.</div>
<div>(bilinear interpolated result is a weighted sum of 4 values, let's substitute tl, tr, bl, br with a, b, c, d for simplification)</div><div><br></div><div>original code : r = a*t0 + b*t1 + c*t2 + d*t3 (in 24 bits precision)</div>
<div>optimized code : r' = a*(t0 >> 8) + b*(t1 >> 8) + c*(t2 >> 8) + d*(t3 >> 8) (in 16 bits precision)</div><div>where t0 + t1 + t2 + t3 = 0x10000</div><div><br></div><div>Now we split "t" into two terms u, v where u is upper 8 bits of t and v is lower 8 bits of t. (note that t0 = u0*256 + v0, t0 >> 8 = u0)</div>
<div><br></div><div>So,</div><div><br></div><div><meta http-equiv="content-type" content="text/html; charset=utf-8">r' = a*u0 + b*u1 + c*u2 + d*u3</div><div><br></div><div>r = a*(u0*256 + v0) + b*(u1*256 + v1) + c*(u2*256 + v2) + d*(u3*256 + v3) </div>
<div> = 256*(a*u0 + b*u1 + c*u2 + d*u3) + a*v0 + b*v1 + c*v2 + d*v3</div><div> = 256*r' + a*v0 + b*v1 + c*v2 + d*v3</div><meta http-equiv="content-type" content="text/html; charset=utf-8"><div><br></div><div>Error would be</div>
<div>e = (r - (r' << 8)) >> 16 = (r - 256*r') >> 16 = (a*v0 + b*v1 + c*v2 + d*v3) >> 16</div><div><br></div><div>Each value a, b, c and d can be 0xff at most, So</div><div><br></div><div>max(e) = (0xff*(v0 + v1 + v2 + v3)) >> 16 = (0xff*max(v0 + v1 + v2 + v3)) >> 16</div>
<div><br></div><div>max(v0 + v1 + v2 + v3) = 0x300 (because lower 8 bits of t0 + t1 + t2 + t3 should be 0x00)</div><div><br></div><div>So max(e) = (0xff*0x300) >> 16 = 2</div><div><br></div><div>But this does not satisfy rule 5 as you mentioned</div>
<div><br></div><div><meta http-equiv="content-type" content="text/html; charset=utf-8"><span class="Apple-style-span" style="border-collapse: collapse; font-family: arial, sans-serif; font-size: 13px; "><font class="Apple-style-span" color="#6600CC"><div class="im">
> Wouldn't it be preferred to have (distxy + distxiy + distixy + distixiy) == 256</div>> here? My guess is that it may be not always the case based on looking at your<br>> code. Which will be a violation of rule "5. Resampling a solid color should<br>
> give a solid color" from <a href="http://www.virtualdub.org/blog/pivot/entry.php?id=86" target="_blank">http://www.virtualdub.org/blog/pivot/entry.php?id=86</a></font></span></div><div><br></div><div>I slightly modified the code to satisfy rule 5.</div>
</div><div>I reduced precision of distx and disty from 8 bits to 4 bits.</div><div><br></div><div><div>static force_inline uint32_t bilinear_interpolation (uint32_t tl, uint32_t tr, </div><div>uint32_t bl, uint32_t br, int distx, int disty)</div>
</div><div>{</div><div><div> int distixiy, distxiy, distixy, distxy;</div><div> uint32_t rb, ga;</div><div><br></div><div> distx = distx >> 4;</div><div> disty = disty >> 4;</div><div><br></div><div>
distxy = distx * disty;</div><div> distixy = (disty << 4) - distxy;</div><div> distxiy = (distx << 4) - distxy;</div><div> distixiy = 256 - (disty << 4) - (distx << 4) + distxy;</div><div>
<br></div><div> rb = (0x00FF00FF & tl)*distixiy + (0x00FF00FF & tr)*distxiy + (0x00FF00FF & bl)*distixy + (0x00FF00FF & br)*distxy;</div><div> rb = (rb >> 8) & 0x00FF00FF;</div><div><br></div>
<div> ga = (0x00FF00FF & (tl >> 8))*distixiy + (0x00FF00FF & (tr >> 8))*distxiy + (0x00FF00FF & (bl >> 8))*distixy + (0x00FF00FF & (br >> 8))*distxy;</div><div> ga = ga & 0xFF00FF00;</div>
<div><br></div><div> return rb | ga;</div></div><div>}</div><div><br></div><div>Now we have distxy + distixy + distxiy + distixiy == 256.</div><div><br></div><div><meta http-equiv="content-type" content="text/html; charset=utf-8"><span class="Apple-style-span" style="border-collapse: collapse; font-family: arial, sans-serif; font-size: 13px; "><font class="Apple-style-span" color="#6600CC">> Now regarding accuracy. I have added some comments above regarding the<br>
> potential solid color issue, but this should be relatively easy to address. I'm<br>> also a bit worried about one more thing (in the original pixman code too, but<br>> let's cover this too while we are discussing accuracy in general). Wouldn't it<br>
> be a good idea to do shift with rounding for the final value instead of<br>> dropping the fractional part? And the 'distx'/'disty' variables are also<br>> obtained by right shifting 'ux' by 8 and dropping fractional part, maybe<br>
> rounding would be more appropriate. Not doing rounding might cause slight image<br>> drift to the left (and top) on repeated rescaling, and also slight reduction of<br>> average brightness.</font></span></div><div>
<font class="Apple-style-span" color="#6600CC" face="arial, sans-serif"><span class="Apple-style-span" style="border-collapse: collapse;"><br></span></font></div><div>I agree with that rounding is more appropriate.</div><div>
I think supplying distx and disty as properly rounded 4 bits values to interpolation function is the best choice we have.</div><div><br></div><div>Analysis on error is some what complicated in this case.</div><div>Error may be bigger than previous code, at least 15 (I've done some brute force jobs)</div>
<div><br></div><div><br></div><div><font class="Apple-style-span" color="#6600CC">> <span class="Apple-style-span" style="border-collapse: collapse; font-family: arial, sans-serif; font-size: 13px; ">I have only one concern about testing. Supposedly when we get both C and SSE2</span></font></div>
<span class="Apple-style-span" style="border-collapse: collapse; font-family: arial, sans-serif; font-size: 13px; "><font class="Apple-style-span" color="#6600CC">> implementations, it would be much easier for testing if they produce identical<br>
> results. Otherwise tests need to be improved to somehow be able to take slight<br></font></span><div><font class="Apple-style-span" color="#6600CC"><span class="Apple-style-span" style="border-collapse: collapse; font-family: arial, sans-serif; font-size: 13px; ">> differences into account.</span> </font></div>
<div><br></div><div>I think the requirement of producing same results for both C & SIMD(maybe sse2, NEON, mmx) is relatively easy.</div><div>But SIMD can produce much better result with less time spent, which can be horribly slow with general C implementation.</div>
<div>I think it is much desirable to keep both C and SIMD code optimized in spite of producing slightly different results.</div>