[Pixman] Image scaling with bilinear interpolation performance

Mon Feb 28 01:22:08 PST 2011

To. Siarhei Siamashka

> Great. Surely contributions in this area would be definitely useful. But
you
> may have started this work a bit too late ;) I have been looking into
improving
> bilinear scaling performance for the last couple of weeks already and have
just
> submitted some initial SSE2 and ARM NEON optimizations for it (btw,
testing is
> very much welcome). And there is still lots of work to do before all the
> bilinear scaling related performance bottlenecks are eliminated.

I'm very glad to hear that you are already doing some work on it.

> Well, in your original e-mail, you mentioned that you are interested in
getting
> good performance on intel quad core. That's why without having any other
> information, I suggested SSE2 as a solution for this problem :)

> I also tried to benchmark your change to the bilinear code and got
something
> like 23% better scaling performance overall on Intel Core i7. I guess you
have
> benchmarked 2x performance improvement for that function alone but not for
a
> full rendering pipeline, right?. It's a good improvement, but not even
close to
> the performance effect of using SSE2 or NEON (or maybe even armv5te). So I
> would consider looking at the supported instruction set on your target
hardware
> first.

I'm sorry about that I did not give you enough information on my test.
I also tuned some codes of the function
bits_image_fetch_bilinear_no_repeat_8888().
I made some fast path for the case where mask == NULL and ux_bottom ==
ux_top.
(But still using temporary source fetch buffer)

Anyway, majority of our target machines support NEON instructions.
Thus, your recent NEON patches are exactly what we are looking for.
I really appreciate what you've done for this :-)

At this point, my concern is about which cases have been (or will be)
optimized using NEON and which are not.
There can be so many fast paths for combination of mask, rotation(including
special angle), scale factors, pixel formats, operators and so on.
As you mentioned it is boring job to make fast paths for all these cases.
So we should focus at the cases most commonly used.
I think most of the common operations are already optimized for both NEON
and SSE2.
What do you think is the remaining parts?

To. Soeren Sandmann

> If we decide to move away from bit-exact testing, we would need to
> decide on an acceptable deviation from ideal, and then update the
> tests to verify that both the C and SIMD implementations are within
> that deviation.
>
> For example there could be a reference implementation that computes
> the bilinear filtering in double precision floating point, adds +/-
> x%, then converts that range into the target bitdepth and finally
> verifies that the output is within that range.

Here, we need reference implementation.
In my opinion, it is overdoing that reference implementation should support
arbitrary affine transformation, various pixel formats, operators and so on.
Because pixman is based on scanline approach, verifying bilinear filtered
fetching from source with only scale transformation would be sufficient.

Here's my simple test code. (bilinear_filter_test.c)

#include <cairo.h>
#include <pixman.h>
#include <stdlib.h>
#include <string.h>
#include <stdint.h>
#include <stdio.h>
#include <assert.h>
#include <math.h>

//#define USE_DOUBLE_PRECISION
//#define USE_RANDOM_SOURCE

#define RED(x)    (((x)&0x00FF0000) >> 16)
#define GREEN(x)    (((x)&0x0000FF00) >> 8)
#define BLUE(x)    ((x)&0x000000FF)
#define ALPHA(x)    (((x)&0xFF000000) >> 24)

#define RED_DOUBLE(x) (((double)RED(x)) / 255.0)
#define GREEN_DOUBLE(x) (((double)GREEN(x)) / 255.0)
#define BLUE_DOUBLE(x) (((double)BLUE(x)) / 255.0)
#define ALPHA_DOUBLE(x) (((double)ALPHA(x)) / 255.0)

#define ROUND_TO_INT(x) ((int)round(x))
#define FLOOR_TO_INT(x) ((int)floor(x))

static int inline
absolute(int val)
{
if( val > 0 )
 return val;
 else
return -val;
}

static uint32_t inline
bilinear_interpolation(uint32_t tl, uint32_t tr, uint32_t bl, uint32_t br,
double distx, double disty)
{
#ifdef USE_DOUBLE_PRECISION
 double r, g, b, a;
 double distxy, distixy, distxiy, distixiy;
 uint32_t result;

distxy = distx*disty;
 distixy = disty - distxy;
 distxiy = distx - distxy;
 distixiy = 1 - distx - disty + distxy;

r = RED_DOUBLE(tl)*distixiy + RED_DOUBLE(tr)*distxiy +
RED_DOUBLE(bl)*distixy + RED_DOUBLE(br)*distxy;
 g = GREEN_DOUBLE(tl)*distixiy + GREEN_DOUBLE(tr)*distxiy +
GREEN_DOUBLE(bl)*distixy + GREEN_DOUBLE(br)*distxy;
 b = BLUE_DOUBLE(tl)*distixiy + BLUE_DOUBLE(tr)*distxiy +
BLUE_DOUBLE(bl)*distixy + BLUE_DOUBLE(br)*distxy;
 a = ALPHA_DOUBLE(tl)*distixiy + ALPHA_DOUBLE(tr)*distxiy +
ALPHA_DOUBLE(bl)*distixy + ALPHA_DOUBLE(br)*distxy;

result = (ROUND_TO_INT(a*255.0) << 24) | (ROUND_TO_INT(r*255.0) << 16) |
(ROUND_TO_INT(g*255.0) << 8) | ROUND_TO_INT(b*255.0);
 return result;
#else // Emulation of pixman_fixed_t
int distxy, distxiy, distixy, distixiy;
 uint32_t f, r;
 int dx, dy;

dx = ((ROUND_TO_INT(distx*65536)) >> 8) & 0xFF;
 dy = ((ROUND_TO_INT(disty*65536)) >> 8) & 0xFF;

distxy = dx * dy;
 distxiy = (dx << 8) - distxy;    /* distx * (256 - disty) */
 distixy = (dy << 8) - distxy;    /* disty * (256 - distx) */
 distixiy =
 256 * 256 - (dy << 8) -
 (dx << 8) + distxy;      /* (256 - distx) * (256 - disty) */

/* Blue */
 r = (tl & 0x000000ff) * distixiy + (tr & 0x000000ff) * distxiy
 + (bl & 0x000000ff) * distixy  + (br & 0x000000ff) * distxy;

/* Green */
 f = (tl & 0x0000ff00) * distixiy + (tr & 0x0000ff00) * distxiy
 + (bl & 0x0000ff00) * distixy  + (br & 0x0000ff00) * distxy;
 r |= f & 0xff000000;

tl >>= 16;
 tr >>= 16;
 bl >>= 16;
 br >>= 16;
 r >>= 16;

/* Red */
 f = (tl & 0x000000ff) * distixiy + (tr & 0x000000ff) * distxiy
 + (bl & 0x000000ff) * distixy  + (br & 0x000000ff) * distxy;
 r |= f & 0x00ff0000;

/* Alpha */
 f = (tl & 0x0000ff00) * distixiy + (tr & 0x0000ff00) * distxiy
 + (bl & 0x0000ff00) * distixy  + (br & 0x0000ff00) * distxy;
 r |= f & 0xff000000;

return r;

#endif
}

static void
generate_reference_image_a8r8g8b8(void *src_buffer, int src_width, int
src_height, int src_stride,
 void *dst_buffer, int dst_width, int dst_height, int dst_stride)
{
int dx, dy, isx, isy;
 double sx, sy;
 double tl, tr, bl, br;
 double distx, disty;
 double xincr;
 uint32_t *dst_row;
 uint32_t *src_top_row;
 uint32_t *src_bottom_row;

#ifdef USE_DOUBLE_PRECISION
 xincr = (double)src_width/(double)dst_width;
#else
xincr = ((double)((src_width << 16) / dst_width)) / 65536.0;
#endif

 for( dy=0; dy<dst_height; ++dy )
 {
#ifdef USE_DOUBLE_PRECISION
 sx = 0.5*(double)src_width/(double)dst_width - 0.5;
 sy = ((double)dy + 0.5)*(double)src_height/(double)dst_height - 0.5;
#else
sx = (0.5*((src_width << 16)/dst_width))/65536.0 - 0.5;
 sy = (((double)dy + 0.5)*((src_height << 16)/dst_height))/65536.0 - 0.5;
#endif
dst_row = (uint32_t*)((unsigned char*)dst_buffer + dst_stride*dy);

assert( sx >= -1.0 );
 assert( sy >= -1.0 && sy < src_height + 1 );

if( sy < 0.0 ) // No source top row
 {
disty = sy - (double)FLOOR_TO_INT(sy);
 src_top_row = NULL;
 src_bottom_row = (uint32_t*)src_buffer;

dx = 0;

while( dx < dst_width && sx < -1.0 )
 {
*dst_row++ = 0x00000000;
 sx += xincr;
 dx++;
 }

 // Left edge
 while( dx < dst_width && sx < 0.0 )
 {
isx = FLOOR_TO_INT(sx);
 distx = sx - (double)isx;

*dst_row++ = bilinear_interpolation(0x00000000, 0x00000000, 0x00000000,
src_bottom_row[isx + 1], distx, disty);

sx += xincr;
 dx++;
 }

 // Main part
 while( dx < dst_width && sx < src_width - 1.0 )
 {
isx = FLOOR_TO_INT(sx);
 distx = sx - (double)isx;

*dst_row++ = bilinear_interpolation(0x00000000, 0x00000000,
src_bottom_row[isx], src_bottom_row[isx + 1], distx, disty);

sx += xincr;
 dx++;
 }

 // Right edge
 while( dx < dst_width && sx < src_width )
 {
isx = FLOOR_TO_INT(sx);
 distx = sx - (double)isx;

*dst_row++ = bilinear_interpolation(0x00000000, 0x00000000,
src_bottom_row[isx], 0x00000000, distx, disty);

sx += xincr;
 dx++;
 }

 while( dx < dst_width )
 {
*dst_row++ = 0x00000000;
 dx++;
 }
}
 else if( sy > src_height ) // No source bottom row
 {
disty = sy - (double)FLOOR_TO_INT(sy);
 src_top_row = (uint32_t*)((unsigned char*)src_buffer +
src_stride*(src_height - 1));
 src_bottom_row = NULL;

dx = 0;

while( dx < dst_width && sx < -1.0 )
 {
*dst_row++ = 0x00000000;
 sx += xincr;
 dx++;
 }

 // Left edge
 while( dx < dst_width && sx < 0.0 )
 {
isx = FLOOR_TO_INT(sx);
 distx = sx - (double)isx;

*dst_row++ = bilinear_interpolation(0x00000000, src_top_row[isx + 1],
0x00000000, 0x00000000, distx, disty);

sx += xincr;
 dx++;
 }

 // Main part
 while( dx < dst_width && sx < src_width - 1.0 )
 {
isx = FLOOR_TO_INT(sx);
 distx = sx - (double)isx;

*dst_row++ = bilinear_interpolation(src_top_row[isx], src_top_row[isx + 1],
0x00000000, 0x00000000, distx, disty);

sx += xincr;
 dx++;
 }

 // Right edge
 while( dx < dst_width && sx < src_width )
 {
isx = FLOOR_TO_INT(sx);
 distx = sx - (double)isx;

*dst_row++ = bilinear_interpolation(src_top_row[isx], 0x00000000,
0x00000000, 0x00000000, distx, disty);

sx += xincr;
 dx++;
 }

 while( dx < dst_width )
 {
*dst_row++ = 0x00000000;
 dx++;
 }
}
 else
{
 isy = FLOOR_TO_INT(sy);
 disty = sy - (double)isy;

src_top_row = (uint32_t*)((unsigned char*)src_buffer + src_stride*isy);
 src_bottom_row = (uint32_t*)((unsigned char*)src_buffer + src_stride*(isy +
1));

dx = 0;

while( dx < dst_width && sx < -1.0 )
 {
*dst_row++ = 0x00000000;
 sx += xincr;
 dx++;
 }

 // Left edge
 while( dx < dst_width && sx < 0.0 )
 {
isx = FLOOR_TO_INT(sx);
 distx = sx - (double)isx;

*dst_row++ = bilinear_interpolation(0x00000000, src_top_row[isx + 1],
0x00000000, src_bottom_row[isx + 1], distx, disty);

sx += xincr;
 dx++;
 }

 // Main part
 while( dx < dst_width && sx < src_width - 1.0 )
 {
isx = FLOOR_TO_INT(sx);
 distx = sx - (double)isx;

*dst_row++ = bilinear_interpolation(src_top_row[isx], src_top_row[isx + 1],
src_bottom_row[isx], src_bottom_row[isx + 1], distx, disty);

sx += xincr;
 dx++;
 }

 // Right edge
 while( dx < dst_width && sx < src_width )
 {
isx = FLOOR_TO_INT(sx);
 distx = sx - (double)isx;

*dst_row++ = bilinear_interpolation(src_top_row[isx], 0x00000000,
src_bottom_row[isx], 0x00000000, distx, disty);

sx += xincr;
 dx++;
 }

 while( dx < dst_width )
 {
*dst_row++ = 0x00000000;
 dx++;
 }
}
 }
}

static void
generate_test_image_a8r8g8b8(void *src_buffer, int src_width, int
src_height, int src_stride,
 void *dst_buffer, int dst_width, int dst_height, int dst_stride)
{
pixman_image_t *    src_img;
 pixman_image_t * dst_img;
 pixman_transform_t  transform;

src_img = pixman_image_create_bits(PIXMAN_a8r8g8b8, src_width, src_height,
src_buffer, src_stride);
 dst_img = pixman_image_create_bits(PIXMAN_a8r8g8b8, dst_width, dst_height,
dst_buffer, dst_stride);

pixman_transform_init_scale(&transform, (src_width << 16)/dst_width,
(src_height << 16)/dst_height);
 pixman_image_set_transform(src_img, &transform);
 pixman_image_set_filter(src_img, PIXMAN_FILTER_BILINEAR, NULL, 0);

pixman_image_composite(PIXMAN_OP_SRC, src_img, NULL, dst_img, 0, 0, 0, 0, 0,
0, dst_width, dst_height);

pixman_image_unref(src_img);
 pixman_image_unref(dst_img);
}

static int
compare_image_a8r8g8b8(void *buffer0, void *buffer1, int width, int height,
int stride, int tolerance)
{
int x, y;
 uint32_t *row0;
 uint32_t *row1;
 int   val0, val1;

for( y=0; y<height; ++y )
 {
row0 = (uint32_t*)((unsigned char*)buffer0 + stride*y);
 row1 = (uint32_t*)((unsigned char*)buffer1 + stride*y);

for( x=0; x<width; ++x )
 {
val0 = RED(*row0);
 val1 = RED(*row1);

if( absolute(val0 - val1) > tolerance )
 {
printf("RED : x=%d y=%d\n", x, y);
 return absolute(val0 - val1);
 }

 val0 = GREEN(*row0);
 val1 = GREEN(*row1);

if( absolute(val0 - val1) > tolerance )
 {
printf("GREEN : x=%d y=%d\n", x, y);
 return absolute(val0 - val1);
 }

 val0 = BLUE(*row0);
 val1 = BLUE(*row1);

if( absolute(val0 - val1) > tolerance )
 {
printf("BLUE : x=%d y=%d\n", x, y);
 return absolute(val0 - val1);
 }

 val0 = ALPHA(*row0);
 val1 = ALPHA(*row1);

if( absolute(val0 - val1) > tolerance )
 {
printf("ALPHA : x=%d y=%d\n", x, y);
 return absolute(val0 - val1);
 }

 row0++;
 row1++;
 }
}

return 0;
}

static int
bilinear_filter_test_a8r8g8b8(void *src_buffer, int src_width, int
src_height, int src_stride,
 unsigned int scale_x, unsigned int scale_y, int tolerance)
{
int dst_width;
 int dst_height;
 int dst_stride;
 void *ref_buffer;
 void *test_buffer;
 cairo_surface_t *dst0, *dst1, *src;
 int result;

dst_width = (src_width * scale_x) >> 8;
 dst_height = (src_height *scale_y) >> 8;

printf("dw = %d dh = %d\n", dst_width, dst_height);
 dst_stride = dst_width * 4;

ref_buffer = malloc(dst_stride*dst_height);
 test_buffer = malloc(dst_stride*dst_height);

memset(ref_buffer, 0x00, dst_stride*dst_height);
 memset(test_buffer, 0x00, dst_stride*dst_height);

generate_reference_image_a8r8g8b8(src_buffer, src_width, src_height,
src_stride,
 ref_buffer, dst_width, dst_height, dst_stride);

generate_test_image_a8r8g8b8(src_buffer, src_width, src_height, src_stride,
 test_buffer, dst_width, dst_height, dst_stride);

src = cairo_image_surface_create_for_data((unsigned char*)src_buffer,
CAIRO_FORMAT_ARGB32, src_width, src_height, src_stride);
 dst0 = cairo_image_surface_create_for_data((unsigned char*)ref_buffer,
CAIRO_FORMAT_ARGB32, dst_width, dst_height, dst_stride);
 dst1 = cairo_image_surface_create_for_data((unsigned char*)test_buffer,
CAIRO_FORMAT_ARGB32, dst_width, dst_height, dst_stride);

cairo_surface_write_to_png(src, "src.png");
 cairo_surface_write_to_png(dst0, "ref.png");
 cairo_surface_write_to_png(dst1, "test.png");

cairo_surface_destroy(src);
 cairo_surface_destroy(dst0);
 cairo_surface_destroy(dst1);

result = compare_image_a8r8g8b8(ref_buffer, test_buffer, dst_width,
dst_height, dst_stride, tolerance);

free(ref_buffer);
 free(test_buffer);

return result;
}

#define MAX_WIDTH   1024
#define MAX_HEIGHT  1024

#define MAX_SCALE_X 0x200
#define MAX_SCALE_Y 0x200

int
main(int argc, const char *argv[])
{
int seed;
 void *src_buffer;
 int src_width, src_height;
 int src_stride;
 unsigned int scale_x, scale_y;
 int tolerance;
 int i, j;
 int result;

 if( argc != 3 )
 {
printf("Usage : bilinear_test seed tolerance\n");
 return 0;
 }

 seed = atoi(argv[1]);
 tolerance = atoi(argv[2]);
 srand(seed);

src_width = rand()%MAX_WIDTH + 1;
 src_height = rand()%MAX_HEIGHT + 1;
 scale_x = rand()%MAX_SCALE_X;
 scale_y = rand()%MAX_SCALE_Y;
 src_stride = src_width * 4;

src_buffer = malloc(src_stride*src_height);

#ifdef USE_RANDOM_SOURCE  // Random source image
 for( i=0; i<src_stride*src_height; ++i )
 ((unsigned char*)src_buffer)[i] = rand()%256;
#else // Check Board source image
for( i=0; i<src_height; ++i )
 {
for( j=0; j<src_width; ++j )
 {
if( (i%16 >= 0 && i%16 < 8))
 {
if( (j%16 >= 0 && j%16 < 8) )
 ((uint32_t*)src_buffer)[i*src_width + j] = 0xFF000000;
 else
((uint32_t*)src_buffer)[i*src_width + j] = 0xFFFFFFFF;
 }
else
 {
if( (j%16 >= 0 && j%16 < 8) )
 ((uint32_t*)src_buffer)[i*src_width + j] = 0xFFFFFFFF;
 else
((uint32_t*)src_buffer)[i*src_width + j] = 0xFF000000;

}
 }
}
#endif

 printf("w=%d h=%d scale_x=%d scale_y=%d\n", src_width, src_height, scale_x,
scale_y);

if( (result = bilinear_filter_test_a8r8g8b8(src_buffer, src_width,
src_height, src_stride, scale_x, scale_y, tolerance)) != 0 )
 printf("Test Failed!!! Difference %d exceeds tolerance %d\n", result,
tolerance);
 else
printf("Test Succeeded!!\n");

free(src_buffer);
 return result;
}

The reference implementation is based on
bits_image_fetch_bilinear_no_repeat_888() function in pixman-bits-image.c.
I think this code can be a start point.

-- 
Best Regards,
Taekyun Kim
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.freedesktop.org/archives/pixman/attachments/20110228/e3a24d7f/attachment.html>