[Pixman] [test PATCH] Use _mm_maddubs_epi16 in BILINEAR_INTERPOLATE_ONE_PIXEL

Sat Sep 29 01:22:28 PDT 2012

On Sat, 29 Sep 2012 00:12:16 -0700
Matt Turner <mattst88 at gmail.com> wrote:

> Siarhei, can you measure any performance improvement with this? I
> can't... :(

I guess that's because you patched the code for the 8-bit
interpolation precision, and pixman is now using 7 bits by default.

But PMADDUBSW can be only used for the first step of interpolation
(vertical) and not the second one (horizontal). Because the first step
does 8-bit * 7-bit -> 15-bit multiplication. And the second step does
a wider 15-bit * 7-bit -> 22-bit multiplication.

The needed changes may look like this:

diff --git a/pixman/pixman-sse2.c b/pixman/pixman-sse2.c
index efed310..b260c95 100644
--- a/pixman/pixman-sse2.c
+++ b/pixman/pixman-sse2.c
@@ -32,6 +32,7 @@
 
 #include <xmmintrin.h> /* for _mm_shuffle_pi16 and _MM_SHUFFLE */
 #include <emmintrin.h> /* for SSE2 intrinsics */
+#include <tmmintrin.h> /* for SSSE3 intrinsics */
 #include "pixman-private.h"
 #include "pixman-combine32.h"
 #include "pixman-inlines.h"
@@ -5401,15 +5402,14 @@ FAST_NEAREST_MAINLOOP_COMMON (sse2_8888_n_8888_normal_OVER,
 #define BMSK ((1 << BILINEAR_INTERPOLATION_BITS) - 1)
 
 #define BILINEAR_DECLARE_VARIABLES						\
-    const __m128i xmm_wt = _mm_set_epi16 (wt, wt, wt, wt, wt, wt, wt, wt);	\
-    const __m128i xmm_wb = _mm_set_epi16 (wb, wb, wb, wb, wb, wb, wb, wb);	\
+    const __m128i xmm_wtb = _mm_set_epi8 (wt, wb, wt, wb, wt, wb, wt, wb,	\
+                                          wt, wb, wt, wb, wt, wb, wt, wb);	\
     const __m128i xmm_xorc8 = _mm_set_epi16 (0, 0, 0, 0, BMSK, BMSK, BMSK, BMSK);\
     const __m128i xmm_addc8 = _mm_set_epi16 (0, 0, 0, 0, 1, 1, 1, 1);		\
     const __m128i xmm_xorc7 = _mm_set_epi16 (0, BMSK, 0, BMSK, 0, BMSK, 0, BMSK);\
     const __m128i xmm_addc7 = _mm_set_epi16 (0, 1, 0, 1, 0, 1, 0, 1);		\
     const __m128i xmm_ux = _mm_set_epi16 (unit_x, unit_x, unit_x, unit_x,	\
 					  unit_x, unit_x, unit_x, unit_x);	\
-    const __m128i xmm_zero = _mm_setzero_si128 ();				\
     __m128i xmm_x = _mm_set_epi16 (vx, vx, vx, vx, vx, vx, vx, vx)
 
 #define BILINEAR_INTERPOLATE_ONE_PIXEL(pix)					\
@@ -5422,10 +5422,7 @@ do {										\
 			    (__m128i *)&src_bottom[pixman_fixed_to_int (vx)]);	\
     vx += unit_x;								\
     /* vertical interpolation */						\
-    a = _mm_add_epi16 (_mm_mullo_epi16 (_mm_unpacklo_epi8 (tltr, xmm_zero),	\
-					xmm_wt),				\
-		       _mm_mullo_epi16 (_mm_unpacklo_epi8 (blbr, xmm_zero),	\
-					xmm_wb));				\
+    a = _mm_maddubs_epi16 (_mm_unpacklo_epi8 (blbr, tltr), xmm_wtb);		\
     if (BILINEAR_INTERPOLATION_BITS < 8)					\
     {										\
 	/* calculate horizontal weights */					\


And I'm getting the following performance improvement on Core i7 860
when running "lowlevel-blt-bench -b src_8888_8888":

before:
           src_8888_8888 =  L1: 318.11  L2: 314.48  M:311.16
after:
           src_8888_8888 =  L1: 356.75  L2: 352.18  M:348.76

That's just ~12% faster. The next step would be to try taking the
compiler out of the way and ensuring that no CPU cycles are wasted :)

-- 
Best regards,
Siarhei Siamashka