[Pixman] [PATCH 1/1 v2] vmx: workarounds to fix powerpc little endian particularities

Tue Jun 2 20:42:35 PDT 2015

On Mon,  1 Jun 2015 16:46:00 -0400
Fernando Seiti Furusato <ferseiti at linux.vnet.ibm.com> wrote:

> I have made some changes to the file pixman-vmx.c, which uses vmx (aka altivec)
> to optimize pixman. Basically, what I did:
>  Changed vec_perm to now perform xor operation over the positions so it will
> work regardless of endianness.
>  Replaced usage of vec_mergeh, vec_mergel and vec_mladd by vec_mulo and
> vec_mule plus vec_add and vec_perm. The result is the same and not affected
> by endianness differences.
>  Replaced usage of vec_lvsl to direct unaligned assignment operation (=). That
> is because, according to Power ABI Specification, the usage of lvsl is
> deprecated on ppc64le.
>  Changed COMPUTE_SHIFT_{MASK,MASKS,MASKC} macro usage to no-op for powerpc
> little endian since unaligned access is supported on ppc64le.
> After those changes, all tests passed on ppc64, ppc64le and powerpc vms.
> 
> Signed-off-by: Fernando Seiti Furusato <ferseiti at linux.vnet.ibm.com>
> ---

Hello,

First here is one trick related to patches handling with git. If you
are sending just a single patch, it is not necessary to accompany it
with an extra cover letter e-mail. The point of the cover letter is to
provide a general overview of the whole patch set and how these
multiple patches fit together.

If you want to provide an extra comment not intended to be a part of
the commit message, it can be added somewhere between the '---'
separator and the "diff --git ..." line in the patch. This part of
the patch is ignored when applying it.

> Hello.
> In the v2 of the patch, I tried to address some concerns, raised by Siarhei
> Siamashka.
> One that I was not sure of how to treat was the vec_perm within splat_alpha
> function. It does not look all pretty, but it will work regardless of the
> endianness.

Yes, indeed.

> Although it was suggested that I created a function that might have performed
> the unpacking by itself, I thought that maybe performing the operation while
> unpacking could save some lines of code and function calls. Thus I just
> replaced that part as you will see in the patch.

Thanks. Adjusting the algorithm is also a good solution if it works at
least as fast as the old code.

> I also ran lowlevel-blt-bench with over_8888_8888 and the results (comparing
> unpatched and patched) were quite similar.

How similar are they? Is the new code faster than the old one? Is it 1%
slower? Is it 5% slower?

Also the 'over_8888_8_8888' operation can be tried because it uses the
'pix_multiply' code more.

In general, it would be best to see the benchmark results included as
part of the commit message.

> Please let me know if you have any comments.

As there are more significant changes than just endian dependent
ifdefs, I tried to have a look at the disassembly dumps of the
generated code. Basically, I have added the following code to
pixman-vmx.c

    vector unsigned int
    vmx_splat_alpha (vector unsigned int pix)
    {
        return splat_alpha(pix);
    }

    vector unsigned int
    vmx_pix_multiply (vector unsigned int p, vector unsigned int a)
    {
        return pix_multiply(p, a);
    }

Then configured and built pixman for powerpc64 (with GCC 4.8.4):
"/configure --host=powerpc64-unknown-linux-gnu && make"

And finally got the disassembly dump:
"powerpc64-unknown-linux-gnu-objdump -d pixman/.libs/libpixman-vmx.a"

>  pixman/pixman-vmx.c | 106 +++++++++++++++++++++++++++++++++-------------------
>  1 file changed, 67 insertions(+), 39 deletions(-)
> 
> diff --git a/pixman/pixman-vmx.c b/pixman/pixman-vmx.c
> index c33631c..57918c1 100644
> --- a/pixman/pixman-vmx.c
> +++ b/pixman/pixman-vmx.c
> @@ -37,46 +37,49 @@
>  static force_inline vector unsigned int
>  splat_alpha (vector unsigned int pix)
>  {
> -    return vec_perm (pix, pix,
> -		     (vector unsigned char)AVV (
> -			 0x00, 0x00, 0x00, 0x00, 0x04, 0x04, 0x04, 0x04,
> -			 0x08, 0x08, 0x08, 0x08, 0x0C, 0x0C, 0x0C, 0x0C));
> +    union {
> +	unsigned short s;
> +	unsigned char c[2];
> +    } endian_xor = {0x0300};
> +
> +    /* endian_xor.c[1] will be 3 if little endian and 0 if big endian */
> +    vector unsigned char perm = vec_splat((vector unsigned char)

A coding style issue here - a space is needed between 'vec_splat'
and '('. There are also a few other occurrences of this in your patch.

> +					  AVV (endian_xor.c[1]),0);
> +    perm = vec_xor (perm,(vector unsigned char) AVV (
> +			  0x00, 0x00, 0x00, 0x00, 0x04, 0x04, 0x04, 0x04,
> +			  0x08, 0x08, 0x08, 0x08, 0x0C, 0x0C, 0x0C, 0x0C));
> +    return vec_perm (pix, pix, perm);
>  }

For this part, both the original and the patched code resulted in
identical instruction sequences:

0000000000000000 <.vmx_splat_alpha>:
   0:	3d 22 00 00 	addis   r9,r2,0
   4:	39 29 00 00 	addi    r9,r9,0
   8:	7c 00 48 ce 	lvx     v0,0,r9
   c:	10 42 10 2b 	vperm   v2,v2,v2,v0
  10:	4e 80 00 20 	blr

This is actually good. I was afraid that the compiler might screw up
it a bit and do something stupid like adding an extra VXOR instruction
here (for the 'vec_xor' intrinsic).

However replacing 4 lines of simple code with 12 lines of something
more obfuscated just for the sake of avoiding to use ifdef does not
look like a great idea to me. In my opinion, this is a perfect case
where ifdef is justified.

Regarding the ADDIS/ADDI/LVX instructions to load a constant.
Avoiding this overhead is exactly the reason why the 'splat_alpha'
function is 'static force_inline'. This instructs the compiler to
always inline it no matter what and we are guaranteed not to have
this overhead in the code in the inner loop. The artificially added
'vmx_splat_alpha' wrapper allows us to see it in the disassembly
dump as a separate function.

>  static force_inline vector unsigned int
>  pix_multiply (vector unsigned int p, vector unsigned int a)
>  {
> -    vector unsigned short hi, lo, mod;
> -
> -    /* unpack to short */
> -    hi = (vector unsigned short)
> -	vec_mergeh ((vector unsigned char)AVV (0),
> -		    (vector unsigned char)p);
> -
> -    mod = (vector unsigned short)
> -	vec_mergeh ((vector unsigned char)AVV (0),
> -		    (vector unsigned char)a);
> -
> -    hi = vec_mladd (hi, mod, (vector unsigned short)
> -                    AVV (0x0080, 0x0080, 0x0080, 0x0080,
> -                         0x0080, 0x0080, 0x0080, 0x0080));
> +    vector unsigned short hi, lo, even, odd;
> +
> +    /* unpack to short while multiplying p and a even positions */
> +    even = vec_mule((vector unsigned char)p, (vector unsigned char)a);
> +    even = vec_add(even, (vector unsigned short)AVV
> +			 (0x0080, 0x0080, 0x0080, 0x0080,
> +			  0x0080, 0x0080, 0x0080, 0x0080));
> +
> +    /* unpack to short while multiplying p and a odd positions */
> +    odd = vec_mulo ((vector unsigned char)p, (vector unsigned char)a);
> +    odd = vec_add (odd, (vector unsigned short)AVV 
> +			(0x0080, 0x0080, 0x0080, 0x0080,
> +			 0x0080, 0x0080, 0x0080, 0x0080));
> +
> +    /* change split from even and odd positions to high and low ends */
> +    hi = vec_perm (even, odd, (vector unsigned char)AVV
> +			     (0x00, 0x01, 0x10, 0x11, 0x02, 0x03, 0x12, 0x13,
> +			      0x04, 0x05, 0x14, 0x15, 0x06, 0x07, 0x16, 0x17));
> +    lo = vec_perm (even, odd, (vector unsigned char)AVV
> +			     (0x08, 0x09, 0x18, 0x19, 0x0A, 0x0B, 0x1A, 0x1B,
> +			      0x0C, 0x0D, 0x1C, 0x1D, 0x0E, 0x0F, 0x1E, 0x1F));
>  
>      hi = vec_adds (hi, vec_sr (hi, vec_splat_u16 (8)));
>  
>      hi = vec_sr (hi, vec_splat_u16 (8));
>  
> -    /* unpack to short */
> -    lo = (vector unsigned short)
> -	vec_mergel ((vector unsigned char)AVV (0),
> -		    (vector unsigned char)p);
> -    mod = (vector unsigned short)
> -	vec_mergel ((vector unsigned char)AVV (0),
> -		    (vector unsigned char)a);
> -
> -    lo = vec_mladd (lo, mod, (vector unsigned short)
> -                    AVV (0x0080, 0x0080, 0x0080, 0x0080,
> -                         0x0080, 0x0080, 0x0080, 0x0080));
> -
>      lo = vec_adds (lo, vec_sr (lo, vec_splat_u16 (8)));
>  
>      lo = vec_sr (lo, vec_splat_u16 (8));

For this part, there are differences in the code before/after patching.

=== Before ===

0000000000000020 <.vmx_pix_multiply>:
  20:	10 21 0c c4 	vxor    v1,v1,v1
  24:	3d 22 00 00 	addis   r9,r2,0
  28:	39 29 00 00 	addi    r9,r9,0
  2c:	11 a1 18 0c 	vmrghb  v13,v1,v3
  30:	7d 80 48 ce 	lvx     v12,0,r9
  34:	10 01 10 0c 	vmrghb  v0,v1,v2
  38:	10 41 11 0c 	vmrglb  v2,v1,v2
  3c:	10 21 19 0c 	vmrglb  v1,v1,v3
  40:	10 00 6b 22 	vmladduhm v0,v0,v13,v12
  44:	11 a8 03 4c 	vspltish v13,8
  48:	10 62 0b 22 	vmladduhm v3,v2,v1,v12
  4c:	10 40 6a 44 	vsrh    v2,v0,v13
  50:	10 23 6a 44 	vsrh    v1,v3,v13
  54:	10 00 12 40 	vadduhs v0,v0,v2
  58:	10 63 0a 40 	vadduhs v3,v3,v1
  5c:	10 40 6a 44 	vsrh    v2,v0,v13
  60:	10 63 6a 44 	vsrh    v3,v3,v13
  64:	10 42 18 8e 	vpkuhus v2,v2,v3
  68:	4e 80 00 20 	blr

=== After ===

0000000000000020 <.vmx_pix_multiply>:
  20:	10 02 1a 08 	vmuleub v0,v2,v3
  24:	3d 22 00 00 	addis   r9,r2,0
  28:	3d 42 00 00 	addis   r10,r2,0
  2c:	11 a8 03 4c 	vspltish v13,8
  30:	10 42 18 08 	vmuloub v2,v2,v3
  34:	39 29 00 00 	addi    r9,r9,0
  38:	39 4a 00 00 	addi    r10,r10,0
  3c:	7c 60 48 ce 	lvx     v3,0,r9
  40:	3d 22 00 00 	addis   r9,r2,0
  44:	7c 20 50 ce 	lvx     v1,0,r10
  48:	39 29 00 00 	addi    r9,r9,0
  4c:	7d 80 48 ce 	lvx     v12,0,r9
  50:	10 00 18 40 	vadduhm v0,v0,v3
  54:	10 42 18 40 	vadduhm v2,v2,v3
  58:	10 60 13 2b 	vperm   v3,v0,v2,v12
  5c:	10 20 10 6b 	vperm   v1,v0,v2,v1
  60:	10 03 6a 44 	vsrh    v0,v3,v13
  64:	10 41 6a 44 	vsrh    v2,v1,v13
  68:	10 21 12 40 	vadduhs v1,v1,v2
  6c:	10 63 02 40 	vadduhs v3,v3,v0
  70:	10 41 6a 44 	vsrh    v2,v1,v13
  74:	10 63 6a 44 	vsrh    v3,v3,v13
  78:	10 42 18 8e 	vpkuhus v2,v2,v3
  7c:	4e 80 00 20 	blr

The new code has more instructions, which does not look great. We get
three LVX instructions instead of just one for loading constants. The
extra constants also need more registers. Because 'pix_multiply' is a
'static force_inline' function too, the constants loading is normally
moved by the compiler out of the loop and it should not cause major
performance problems though.

But just based on looking at the disassembly dump, the old variant
subjectively seems to be preferable. I think that this change needs
to be carefully benchmarked.

I have a Playstation3 with OtherOS/Linux and would also run these
benchmarks myself, but I'm currently away from home and will not be
able to access it for a few more weeks.

> @@ -129,29 +132,26 @@ over (vector unsigned int src,
>      over (pix_multiply (src, mask),					\
>            pix_multiply (srca, mask), dest)
>  
> +#ifdef WORDS_BIGENDIAN
>  
> -#define COMPUTE_SHIFT_MASK(source)					\
> +# define COMPUTE_SHIFT_MASK(source)					\
>      source ## _mask = vec_lvsl (0, source);
>  
> -#define COMPUTE_SHIFT_MASKS(dest, source)				\
> +# define COMPUTE_SHIFT_MASKS(dest, source)				\
>      source ## _mask = vec_lvsl (0, source);
>  
> -#define COMPUTE_SHIFT_MASKC(dest, source, mask)				\
> +# define COMPUTE_SHIFT_MASKC(dest, source, mask)			\
>      mask ## _mask = vec_lvsl (0, mask);					\
>      source ## _mask = vec_lvsl (0, source);
>  
> -/* notice you have to declare temp vars...
> - * Note: tmp3 and tmp4 must remain untouched!
> - */
> -
> -#define LOAD_VECTORS(dest, source)			  \
> +# define LOAD_VECTORS(dest, source)			  \
>      tmp1 = (typeof(tmp1))vec_ld (0, source);		  \
>      tmp2 = (typeof(tmp2))vec_ld (15, source);		  \
>      v ## source = (typeof(v ## source))			  \
>  	vec_perm (tmp1, tmp2, source ## _mask);		  \
>      v ## dest = (typeof(v ## dest))vec_ld (0, dest);
>  
> -#define LOAD_VECTORSC(dest, source, mask)		  \
> +# define LOAD_VECTORSC(dest, source, mask)		  \
>      tmp1 = (typeof(tmp1))vec_ld (0, source);		  \
>      tmp2 = (typeof(tmp2))vec_ld (15, source);		  \
>      v ## source = (typeof(v ## source))			  \
> @@ -162,6 +162,34 @@ over (vector unsigned int src,
>      v ## mask = (typeof(v ## mask))			  \
>  	vec_perm (tmp1, tmp2, mask ## _mask);
>  
> +#else //WORDS_BIGENDIAN
> +
> +/* Now the COMPUTE_SHIFT_{MASK, MASKS, MASKC} below are just no-op.
> + * They are defined that way because little endian altivec can do unaligned
> + * reads natively and have no need for constructing the permutation pattern
> + * variables.
> + */
> +# define COMPUTE_SHIFT_MASK(source)
> +
> +# define COMPUTE_SHIFT_MASKS(dest, source)
> +
> +# define COMPUTE_SHIFT_MASKC(dest, source, mask)
> +
> +# define LOAD_VECTORS(dest, source)                        \
> +    v ## source = *((typeof(v ## source)*)source);        \
> +    v ## dest = *((typeof(v ## dest)*)dest);
> +
> +# define LOAD_VECTORSC(dest, source, mask)                 \
> +    v ## source = *((typeof(v ## source)*)source);        \
> +    v ## dest = *((typeof(v ## dest)*)dest);              \
> +    v ## mask = *((typeof(v ## mask)*)mask);
> +
> +#endif //WORDS_BIGENDIAN
> +
> +/* notice you have to declare temp vars...
> + * Note: tmp3 and tmp4 must remain untouched!
> + */
> +
>  #define LOAD_VECTORSM(dest, source, mask)				\
>      LOAD_VECTORSC (dest, source, mask)					\
>      v ## source = pix_multiply (v ## source,				\

To sum it up:

0. Coding style.
1. Please just use ifdefs for 'splat_alpha'
2. Double check the benchmark results for the 'pix_multiply' changes.
   It is perfectly fine if the results are the same or faster compared
   to the old code. But if the results become worse, then consider
   finding a way to keep the old variant.

And then the patch should be good. Thanks for addressing all the
previous comments and for your patience.

-- 
Best regards,
Siarhei Siamashka