[Pixman] fast-scale branch performance improvements

Tue Mar 16 13:54:13 PDT 2010

On Tue, 2010-03-16 at 19:22 +0100, Alexander Larsson wrote:
> On Tue, 2010-03-16 at 20:17 +0200, Siarhei Siamashka wrote:
> > On Tuesday 16 March 2010, Siarhei Siamashka wrote:
> > > On Tuesday 16 March 2010, Alexander Larsson wrote:
> > > > On Tue, 2010-03-16 at 16:51 +0200, Siarhei Siamashka wrote:
> > > > > Regarding the alex's branch and performance, I already
> mentioned
> > that
> > > > > it was
> > > > > much slower for over_8888_0565 case in my benchmark when
> > compared
> > > > > against my
> > > > > branch on ARM Cortex-A8 (the other cases of scaling are ok).
> I'm
> > using
> > > > > the
> > > > > following test program for benchmarking these optimizations:
> > > > >
> >
> http://cgit.freedesktop.org/~siamashka/pixman/commit/?h=test-n-bench&i
> > > > > d=93ec60149cb3535f70a9e285de0b359ff444f26e
> > > > >
> > > > > The test program tries to benchmark scaling of when source and
> > > > > destination
> > > > > image sizes are approximately the same (and the performance
> can
> > be
> > > > > more or
> > > > > less directly compared to the simple nonscaled blit).
> > > > >
> > > > > The results are (variance is only in the last digit):
> > > > >
> > > > > op=3, src_fmt=20028888, dst_fmt=10020565, speed=5.06 MPix/s
> > (1.21 FPS)
> > > > > vs.
> > > > > op=3, src_fmt=20028888, dst_fmt=10020565, speed=8.72 MPix/s
> > (2.08 FPS)
> > > > >
> > > > > which is quite a lot.
> > > >
> > > > Can you retry with my new branch:
> > > > http://cgit.freedesktop.org/~alexl/pixman/log/?h=alex-scaler2
> > >
> > > Now it is:
> > > op=3, src_fmt=20028888, dst_fmt=10020565, speed=5.16 MPix/s (1.23
> > FPS)
> > >
> > > A little bit better, but still not good.
> > 
> > Found the problem, it's here:
> > > + SIMPLE_NEAREST_FAST_PATH (OVER, a8b8g8r8, r5g6b5, 8888_565),
> > This should have a8r8g8b8 instead of a8b8g8r8. So this fast path
> just
> > was not
> > run at all. Once fixed, it shows the expected performance.
> > 
> > 
> > Also 'alex-scaler2' branch is substantially slower than
> 'alex-scaler'
> > for
> > normal repeat:
> > 
> > == nearest tiled SRC (alex-scaler) ==
> > op=1, src_fmt=20028888, dst_fmt=20028888, speed=90.91 MPix/s (21.67
> > FPS)
> > op=1, src_fmt=20028888, dst_fmt=10020565, speed=63.82 MPix/s (15.22
> > FPS)
> > op=1, src_fmt=10020565, dst_fmt=10020565, speed=92.16 MPix/s (21.97
> > FPS)
> > 
> > == nearest tiled SRC (alex-scaler2) ==
> > op=1, src_fmt=20028888, dst_fmt=20028888, speed=76.54 MPix/s (18.25
> > FPS)
> > op=1, src_fmt=20028888, dst_fmt=10020565, speed=50.44 MPix/s (12.03
> > FPS)
> > op=1, src_fmt=10020565, dst_fmt=10020565, speed=67.14 MPix/s (16.01
> > FPS)
> 
> This may well be the change from open-coding the repeat to using the
> repeat() inline function.

It is, the following patch unapplying the use of repeat() changes from:

== nearest tiled SRC ==
op=1, src_fmt=20028888, dst_fmt=20028888, speed=349.27 MPix/s (83.27 FPS)
op=1, src_fmt=20028888, dst_fmt=10020565, speed=137.86 MPix/s (32.87 FPS)
op=1, src_fmt=10020565, dst_fmt=10020565, speed=364.15 MPix/s (86.82 FPS)

to:

== nearest tiled SRC ==
op=1, src_fmt=20028888, dst_fmt=20028888, speed=556.82 MPix/s (132.76 FPS)
op=1, src_fmt=20028888, dst_fmt=10020565, speed=136.19 MPix/s (32.47 FPS)
op=1, src_fmt=10020565, dst_fmt=10020565, speed=471.76 MPix/s (112.48 FPS)

Which is nothing to sneeze at

diff --git a/pixman/pixman-fast-path.c b/pixman/pixman-fast-path.c
index 6607a47..b6c8f7c 100644
--- a/pixman/pixman-fast-path.c
+++ b/pixman/pixman-fast-path.c
@@ -1474,7 +1474,12 @@ fast_composite_scaled_nearest_ ## scale_func_name ## _ ## OP (pixman_implementat
 	y = vy >> 16;										\
 	vy += unit_y;										\
 	if (do_repeat)										\
-	    repeat (PIXMAN_REPEAT_NORMAL, &vy, max_vy);						\
+	{											\
+	    if (unit_y >= 0)									\
+		while (vy >= max_vy) vy -= max_vy;						\
+	    else										\
+		while (vy < 0) vy += max_vy;							\
+	}											\
 												\
 	src = src_first_line + src_stride * y;							\
 												\
@@ -1485,13 +1490,23 @@ fast_composite_scaled_nearest_ ## scale_func_name ## _ ## OP (pixman_implementat
 	    x1 = vx >> 16;									\
 	    vx += unit_x;									\
 	    if (do_repeat)									\
-		repeat (PIXMAN_REPEAT_NORMAL, &vx, max_vx);					\
+		{										\
+		if (unit_x >= 0)								\
+		    while (vx >= max_vx) vx -= max_vx;						\
+		else										\
+		    while (vx < 0) vx += max_vx;						\
+	    }											\
 	    s1 = src[x1];									\
 												\
 	    x2 = vx >> 16;									\
 	    vx += unit_x;									\
 	    if (do_repeat)									\
-		repeat (PIXMAN_REPEAT_NORMAL, &vx, max_vx);					\
+		{										\
+		if (unit_x >= 0)								\
+		    while (vx >= max_vx) vx -= max_vx;						\
+		else										\
+		    while (vx < 0) vx += max_vx;						\
+	    }											\
 	    s2 = src[x2];									\
 												\
 	    if (PIXMAN_OP_ ## OP == PIXMAN_OP_OVER)						\
@@ -1537,7 +1552,12 @@ fast_composite_scaled_nearest_ ## scale_func_name ## _ ## OP (pixman_implementat
 	    x1 = vx >> 16;									\
 	    vx += unit_x;									\
 	    if (do_repeat)									\
-		repeat (PIXMAN_REPEAT_NORMAL, &vx, max_vx);					\
+		{										\
+		if (unit_x >= 0)								\
+		    while (vx >= max_vx) vx -= max_vx;						\
+		else										\
+		    while (vx < 0) vx += max_vx;						\
+	    }											\
 	    s1 = src[x1];									\
 												\
 	    if (PIXMAN_OP_ ## OP == PIXMAN_OP_OVER)						\


Further simplifying by removing support for unit_x < 0 (i.e. mirrored
scaling) gives:

== nearest tiled SRC ==
op=1, src_fmt=20028888, dst_fmt=20028888, speed=655.11 MPix/s (156.19 FPS)
op=1, src_fmt=20028888, dst_fmt=10020565, speed=136.02 MPix/s (32.43 FPS)
op=1, src_fmt=10020565, dst_fmt=10020565, speed=619.16 MPix/s (147.62 FPS)

It might be interesting to duplicate the inner loop, once for each
unit_x sign to get this performance increase always.


-- 
=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=
 Alexander Larsson                                            Red Hat, Inc 
       alexl at redhat.com            alexander.larsson at gmail.com 
He's a bookish moralistic cyborg plagued by the memory of his family's brutal 
murder. She's a sarcastic punk stripper with her own daytime radio talk show. 
They fight crime!