[Pixman] [PATCH 3/7] C variant of bilinear scaled 'src_8888_8888' fast path

Sat Feb 26 10:22:11 PST 2011

On Saturday 26 February 2011 14:33:02 Soeren Sandmann wrote:
> Hi,

Hi,

First of all, I would say that this C fast path was primarily added in order to
benchmark the effect of doing single pass scaling without intermediate buffer
on different processors. With the intention to see whether this single pass
processing has any advantage over just putting efforts into SIMD optimizing
bilinear fetchers and be done with that. I think it proved to be useful,
especially on ARM and MIPS.

But I'm actually in favor of dropping all these bilinear scaling C fast paths
for now, especially considering that the discussion is still ongoing about
how to improve bilinear scaling performance when SIMD extensions are not
available.

Another reason to have bilinear C fast paths in pixman is that they could be
used as a reference C code if/when we get contributors willing to also optimize
bilinear scaling for other architectures  (powerpc VMX, MIPS DSP ASE, ...). So
when we get a question like "Could you let me know the bilinear scaling
interfaces in pixman and where the SIMD optimization will be applied?"
again, we would have some place in the code to show. And even better, we
should proactively prepare some documentation with some basic introductory
information about adding new CPU specific optimizations to pixman, so that
the contributors will be able to save their time and avoid common pitfalls.

> Most of this series looks good to me, except that this:
> >  static force_inline uint32_t
> > 
> > +bilinear_interpolation (uint32_t tl, uint32_t tr,
> > +			uint32_t bl, uint32_t br,
> > +			int distx, int wt, int wb)
> > +{
> 
> is a duplication of the code from pixman-bits-image, except that disty
> has been split into wt and wb so that it can deal with the top and
> bottom edges of the image.
> 
> The bits_image_fetch_bilinear_no_repeat_8888() function handles the same
> problem by simply passing 0 for tl/tr or bl/br. Could it be done the
> same way here?

It could be done in a somewhat similar way by passing some extra information to
the scanline processing function so that it could identify that it is a special
case and that one of the scanlines is zero, and even gain a bit more speed in
such a case because just a single scanline would need to be processed. But that
would make scanline processing function more complex. And the primary goal was
to move all the complexity to the main loop whenever possible.

> Alternatively, if there is a good reason for the split,
> presumably that same reason applies to the code in pixman-bits-image.c.

Yes, in the end it has the same complexity and performance as the code from 
'pixman-bits-image.c' after changing 

+ distxy = distx * wb;
+ distxiy = distx * wt;
+ distixy = wb * (256 - distx);
+ distixiy = (256 - distx) * wt;

to

  distxy = distx * wb;
  distxiy = distx * wt;
  distixy = (wb << 8) - distxy;
  distixiy = (wt << 8) - distxiy;

What do you think would be the best header file to move this
'bilinear_interpolation' inline function for common reuse?

But in any case, applying additional optimizations to bilinear scaling C fast
path may make the use of this 'bilinear_interpolation' function unnecessary
here. For example, using the following code for scaling provides some good
speedup at least on 32-bit systems:

static void
bilinear_interpolate_line (uint32_t *       buffer,
                           const uint32_t * top_row,
                           const uint32_t * bottom_row,
                           int              wt,
                           int              wb,
                           pixman_fixed_t   x,
                           pixman_fixed_t   ux,
                           int              width)
{
#ifdef WORDS_BIGENDIAN
    #define OFF_B 3
    #define OFF_G 2
    #define OFF_R 1
    #define OFF_A 0
#else
    #define OFF_B 0
    #define OFF_G 1
    #define OFF_R 2
    #define OFF_A 3
#endif
    while (--width >= 0)
    {
        int distx;
        int distxy, distxiy, distixy, distixiy;
        uint32_t r;
        const uint8_t *t, *b;

        t = (uint8_t *) &top_row [pixman_fixed_to_int (x)];
        b = (uint8_t *) &bottom_row [pixman_fixed_to_int (x)];

        distx = (x >> 8) & 0xff;
        distxy = distx * wb;
        distxiy = distx * wt;
        distixy = (wb << 8) - distxy;
        distixiy = (wt << 8) - distxiy;

        /* Blue */
        r = t[OFF_B] * distixiy + t[OFF_B + 4] * distxiy +
            b[OFF_B] * distixy  + b[OFF_B + 4] * distxy;

        /* Green */
        r |= ((t[OFF_G] * distixiy + t[OFF_G + 4] * distxiy +
               b[OFF_G] * distixy  + b[OFF_G + 4] * distxy) << 8) & 0xff000000;

        r >>= 16;

        /* Red */
        r |= (t[OFF_R] * distixiy + t[OFF_R + 4] * distxiy +
              b[OFF_R] * distixy  + b[OFF_R + 4] * distxy) & 0x00ff0000;

        /* Alpha */
        r |= ((t[OFF_A] * distixiy + t[OFF_A + 4] * distxiy +
               b[OFF_A] * distixy  + b[OFF_A + 4] * distxy) << 8) & 0xff000000;

        *buffer++ = r;
        x += ux;
    }

    #undef OFF_B
    #undef OFF_G
    #undef OFF_R
    #undef OFF_A
}

By directly reading color components as individual bytes from memory and
avoiding masking/extracting them from uint32_t variable, part of the work is
moved from ALU to LSU, which improves performance for many processors.
On Intel Core i7, it becomes almost as fast as using 64-bit variant of
'bilinear_interpolation' function, but still slightly falls behind:
    32-bit 'bilinear_interpolation' - 73.85 MPix/s
    64-bit 'bilinear_interpolation' - 82.41 MPix/s
    new 'bilinear_interpolate_line' - 81.02 MPix/s

- 
Best regards,
Siarhei Siamashka