[Pixman] fast-scale branch performance improvements
Alexander Larsson
alexl at redhat.com
Tue Mar 16 13:54:13 PDT 2010
On Tue, 2010-03-16 at 19:22 +0100, Alexander Larsson wrote:
> On Tue, 2010-03-16 at 20:17 +0200, Siarhei Siamashka wrote:
> > On Tuesday 16 March 2010, Siarhei Siamashka wrote:
> > > On Tuesday 16 March 2010, Alexander Larsson wrote:
> > > > On Tue, 2010-03-16 at 16:51 +0200, Siarhei Siamashka wrote:
> > > > > Regarding the alex's branch and performance, I already
> mentioned
> > that
> > > > > it was
> > > > > much slower for over_8888_0565 case in my benchmark when
> > compared
> > > > > against my
> > > > > branch on ARM Cortex-A8 (the other cases of scaling are ok).
> I'm
> > using
> > > > > the
> > > > > following test program for benchmarking these optimizations:
> > > > >
> >
> http://cgit.freedesktop.org/~siamashka/pixman/commit/?h=test-n-bench&i
> > > > > d=93ec60149cb3535f70a9e285de0b359ff444f26e
> > > > >
> > > > > The test program tries to benchmark scaling of when source and
> > > > > destination
> > > > > image sizes are approximately the same (and the performance
> can
> > be
> > > > > more or
> > > > > less directly compared to the simple nonscaled blit).
> > > > >
> > > > > The results are (variance is only in the last digit):
> > > > >
> > > > > op=3, src_fmt=20028888, dst_fmt=10020565, speed=5.06 MPix/s
> > (1.21 FPS)
> > > > > vs.
> > > > > op=3, src_fmt=20028888, dst_fmt=10020565, speed=8.72 MPix/s
> > (2.08 FPS)
> > > > >
> > > > > which is quite a lot.
> > > >
> > > > Can you retry with my new branch:
> > > > http://cgit.freedesktop.org/~alexl/pixman/log/?h=alex-scaler2
> > >
> > > Now it is:
> > > op=3, src_fmt=20028888, dst_fmt=10020565, speed=5.16 MPix/s (1.23
> > FPS)
> > >
> > > A little bit better, but still not good.
> >
> > Found the problem, it's here:
> > > + SIMPLE_NEAREST_FAST_PATH (OVER, a8b8g8r8, r5g6b5, 8888_565),
> > This should have a8r8g8b8 instead of a8b8g8r8. So this fast path
> just
> > was not
> > run at all. Once fixed, it shows the expected performance.
> >
> >
> > Also 'alex-scaler2' branch is substantially slower than
> 'alex-scaler'
> > for
> > normal repeat:
> >
> > == nearest tiled SRC (alex-scaler) ==
> > op=1, src_fmt=20028888, dst_fmt=20028888, speed=90.91 MPix/s (21.67
> > FPS)
> > op=1, src_fmt=20028888, dst_fmt=10020565, speed=63.82 MPix/s (15.22
> > FPS)
> > op=1, src_fmt=10020565, dst_fmt=10020565, speed=92.16 MPix/s (21.97
> > FPS)
> >
> > == nearest tiled SRC (alex-scaler2) ==
> > op=1, src_fmt=20028888, dst_fmt=20028888, speed=76.54 MPix/s (18.25
> > FPS)
> > op=1, src_fmt=20028888, dst_fmt=10020565, speed=50.44 MPix/s (12.03
> > FPS)
> > op=1, src_fmt=10020565, dst_fmt=10020565, speed=67.14 MPix/s (16.01
> > FPS)
>
> This may well be the change from open-coding the repeat to using the
> repeat() inline function.
It is, the following patch unapplying the use of repeat() changes from:
== nearest tiled SRC ==
op=1, src_fmt=20028888, dst_fmt=20028888, speed=349.27 MPix/s (83.27 FPS)
op=1, src_fmt=20028888, dst_fmt=10020565, speed=137.86 MPix/s (32.87 FPS)
op=1, src_fmt=10020565, dst_fmt=10020565, speed=364.15 MPix/s (86.82 FPS)
to:
== nearest tiled SRC ==
op=1, src_fmt=20028888, dst_fmt=20028888, speed=556.82 MPix/s (132.76 FPS)
op=1, src_fmt=20028888, dst_fmt=10020565, speed=136.19 MPix/s (32.47 FPS)
op=1, src_fmt=10020565, dst_fmt=10020565, speed=471.76 MPix/s (112.48 FPS)
Which is nothing to sneeze at
diff --git a/pixman/pixman-fast-path.c b/pixman/pixman-fast-path.c
index 6607a47..b6c8f7c 100644
--- a/pixman/pixman-fast-path.c
+++ b/pixman/pixman-fast-path.c
@@ -1474,7 +1474,12 @@ fast_composite_scaled_nearest_ ## scale_func_name ## _ ## OP (pixman_implementat
y = vy >> 16; \
vy += unit_y; \
if (do_repeat) \
- repeat (PIXMAN_REPEAT_NORMAL, &vy, max_vy); \
+ { \
+ if (unit_y >= 0) \
+ while (vy >= max_vy) vy -= max_vy; \
+ else \
+ while (vy < 0) vy += max_vy; \
+ } \
\
src = src_first_line + src_stride * y; \
\
@@ -1485,13 +1490,23 @@ fast_composite_scaled_nearest_ ## scale_func_name ## _ ## OP (pixman_implementat
x1 = vx >> 16; \
vx += unit_x; \
if (do_repeat) \
- repeat (PIXMAN_REPEAT_NORMAL, &vx, max_vx); \
+ { \
+ if (unit_x >= 0) \
+ while (vx >= max_vx) vx -= max_vx; \
+ else \
+ while (vx < 0) vx += max_vx; \
+ } \
s1 = src[x1]; \
\
x2 = vx >> 16; \
vx += unit_x; \
if (do_repeat) \
- repeat (PIXMAN_REPEAT_NORMAL, &vx, max_vx); \
+ { \
+ if (unit_x >= 0) \
+ while (vx >= max_vx) vx -= max_vx; \
+ else \
+ while (vx < 0) vx += max_vx; \
+ } \
s2 = src[x2]; \
\
if (PIXMAN_OP_ ## OP == PIXMAN_OP_OVER) \
@@ -1537,7 +1552,12 @@ fast_composite_scaled_nearest_ ## scale_func_name ## _ ## OP (pixman_implementat
x1 = vx >> 16; \
vx += unit_x; \
if (do_repeat) \
- repeat (PIXMAN_REPEAT_NORMAL, &vx, max_vx); \
+ { \
+ if (unit_x >= 0) \
+ while (vx >= max_vx) vx -= max_vx; \
+ else \
+ while (vx < 0) vx += max_vx; \
+ } \
s1 = src[x1]; \
\
if (PIXMAN_OP_ ## OP == PIXMAN_OP_OVER) \
Further simplifying by removing support for unit_x < 0 (i.e. mirrored
scaling) gives:
== nearest tiled SRC ==
op=1, src_fmt=20028888, dst_fmt=20028888, speed=655.11 MPix/s (156.19 FPS)
op=1, src_fmt=20028888, dst_fmt=10020565, speed=136.02 MPix/s (32.43 FPS)
op=1, src_fmt=10020565, dst_fmt=10020565, speed=619.16 MPix/s (147.62 FPS)
It might be interesting to duplicate the inner loop, once for each
unit_x sign to get this performance increase always.
--
=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=
Alexander Larsson Red Hat, Inc
alexl at redhat.com alexander.larsson at gmail.com
He's a bookish moralistic cyborg plagued by the memory of his family's brutal
murder. She's a sarcastic punk stripper with her own daytime radio talk show.
They fight crime!
More information about the Pixman
mailing list