[Pixman] [PATCH 3/4] sse2: affine bilinear fetcher

Fri Feb 1 04:23:15 PST 2013

On Tue, Jan 29, 2013 at 11:21 AM, Siarhei Siamashka
<siarhei.siamashka at gmail.com> wrote:
> > +    if (BILINEAR_INTERPOLATION_BITS < 8)
> > +    {
> > +     const __m128i xmm_xorc7 = _mm_set_epi16 (0, BMSK, 0, BMSK, 0, BMSK, 0, BMSK);
> > +     const __m128i xmm_addc7 = _mm_set_epi16 (0, 1, 0, 1, 0, 1, 0, 1);
> > +     const __m128i xmm_x = _mm_set_epi16 (dx, dx, dx, dx, dx, dx, dx, dx);
> > +
> > +     /* calculate horizontal weights */
> > +     xmm_wh = _mm_add_epi16 (xmm_addc7, _mm_xor_si128 (xmm_xorc7, xmm_x));
>
> A minor improvement is possible here, which avoids extra calculations:
>
>     const int32_t wh_pair = (BILINEAR_INTERPOLATION_RANGE - dx) | (dx << 16);
>     xmm_wh = _mm_set_epi32 (wh_pair, wh_pair, wh_pair, wh_pair);

I have to take this back. I expected that the reduction of the number
of SSE2 instructions (which should be the bottleneck) would improve
performance and scalar instructions could be run "for free", but
benchmarks are showing strange results and also the compiler generated
code does not look very good (I can see unjustified spills to stack
with gcc 4.7).

Also

wh_pair = (BILINEAR_INTERPOLATION_RANGE - dx) | (dx << 16) =
(BILINEAR_INTERPOLATION_RANGE - dx) + (dx * 65536) =
BILINEAR_INTERPOLATION_RANGE + dx * 65535

The latter variant needs only two scalar instructions (imul + add),
but high multiplication latency may cause performance problems if the
instructions are not scheduled right.

Anyway, I'm going to try a complete assembly implementation of
bilinear scaling on Monday, optimized at least for Intel Atom.

--
Best regards,
Siarhei Siamashka