[Pixman] [PATCH 1/1] vmx: workarounds to fix powerpc little endian particularities

Fri May 29 04:45:38 PDT 2015

On Thu, 28 May 2015 10:22:48 -0400
Fernando Seiti Furusato <ferseiti at linux.vnet.ibm.com> wrote:

> I have made some changes to the file pixman-vmx.c, which uses vmx (aka altivec)
> to optimize pixman. Basically, what I did:
>  Changed the usage of vec_perm, vec_mergeh and vec_mergel. They were giving
> weird results when running on little endian. That was because the integer
> vectors were being cast to char, which made them be ordered and permuted byte
> by byte.
>  Replaced usage of vec_lvsl to direct unaligned assignment operation (=). That
> is because, according to Power ABI Specification, the usage of lvsl is
> deprecated on ppc64le.
>  Changed COMPUTE_SHIFT_{MASK,MASKS,MASKC} macro usage to no-op for powerpc
> little endian since unaligned access is supported on ppc64le.
> After those changes, all tests passed on ppc64le. Tests on ppc64 and powerpc
> got the same results as before the changes.
> 
> Signed-off-by: Fernando Seiti Furusato <ferseiti at linux.vnet.ibm.com>

Thanks for the updated patch. It addresses most of the review comments.
Just one thing could be still improved.

> ---
>  pixman/pixman-vmx.c | 97 ++++++++++++++++++++++++++++++++++++++++++++++++-----
>  1 file changed, 88 insertions(+), 9 deletions(-)
> 
> diff --git a/pixman/pixman-vmx.c b/pixman/pixman-vmx.c
> index c33631c..5b6e98c 100644
> --- a/pixman/pixman-vmx.c
> +++ b/pixman/pixman-vmx.c
> @@ -34,6 +34,7 @@
>  
>  #define AVV(x...) {x}
>  
> +#ifdef WORDS_BIGENDIAN
>  static force_inline vector unsigned int
>  splat_alpha (vector unsigned int pix)
>  {
> @@ -84,6 +85,59 @@ pix_multiply (vector unsigned int p, vector unsigned int a)
>      return (vector unsigned int)vec_packsu (hi, lo);
>  }
>  
> +#else //ifdef WORDS_BIGENDIAN
> +
> +static force_inline vector unsigned int
> +splat_alpha (vector unsigned int pix)
> +{
> +    return vec_perm (pix, pix,
> +		     (vector unsigned char)AVV (
> +			 0x03, 0x03, 0x03, 0x03, 0x07, 0x07, 0x07, 0x07,
> +			 0x0B, 0x0B, 0x0B, 0x0B, 0x0F, 0x0F, 0x0F, 0x0F));
> +}
> +
> +static force_inline vector unsigned int
> +pix_multiply (vector unsigned int p, vector unsigned int a)
> +{
> +    vector unsigned short hi, lo, mod;
> +
> +    /* unpack to short */
> +    hi = (vector unsigned short)
> +	vec_mergeh ((vector unsigned char)p,
> +		    (vector unsigned char)AVV (0));
> +
> +    mod = (vector unsigned short)
> +	vec_mergeh ((vector unsigned char)a,
> +		    (vector unsigned char)AVV (0));
> +
> +    hi = vec_mladd (hi, mod, (vector unsigned short)
> +                    AVV (0x0080, 0x0080, 0x0080, 0x0080,
> +                         0x0080, 0x0080, 0x0080, 0x0080));
> +
> +    hi = vec_adds (hi, vec_sr (hi, vec_splat_u16 (8)));
> +
> +    hi = vec_sr (hi, vec_splat_u16 (8));
> +
> +    /* unpack to short */
> +    lo = (vector unsigned short)
> +	vec_mergel ((vector unsigned char)p,
> +		    (vector unsigned char)AVV (0));
> +    mod = (vector unsigned short)
> +	vec_mergel ((vector unsigned char)a,
> +		    (vector unsigned char)AVV (0));
> +
> +    lo = vec_mladd (lo, mod, (vector unsigned short)
> +                    AVV (0x0080, 0x0080, 0x0080, 0x0080,
> +                         0x0080, 0x0080, 0x0080, 0x0080));
> +
> +    lo = vec_adds (lo, vec_sr (lo, vec_splat_u16 (8)));
> +
> +    lo = vec_sr (lo, vec_splat_u16 (8));
> +
> +    return (vector unsigned int)vec_packsu (hi, lo);
> +}

Sorry, I did not explain it properly earlier. I did not mean
to replicate the whole 'pix_multiply' function in the little
endian code path.

The point is that we need to take care of some differences
between big and little endian powerpc variants. And we want
to isolate these differences in the smallest possible unit,
while still keeping the code readable and maintainable.

In your first patch, you treated the vec_mergel/vec_mergeh
intrinsics as such smallest units and replaced them with the
MERGEL/MERGEH macros. It works correctly in this context, but
my complaint was that the code is not very clean this way. If
the vec_mergel/vec_mergeh intrinsics were supposed to be endian
agnostic and behave like the MERGEL/MERGEH macros, then I
guess that the vec_mergel/vec_mergeh intrinsics would have
been implemented this way in the first place. Redefining the
meaning of the standard intrinsics looks ugly. Basically, the
individual intrinsics are too small for using them for isolating
big/little endian differences in a non-controversial way.

Now in this new patch the whole 'pix_multiply' function is
replicated. But it is too large. And duplicating a lot of
code is not nice either.

What I actually wanted to see was a new static force_inline function.
This function can isolate the big/little endian differences, be
reasonably small and do something meaningful by itself. For example,
the SSE2 code has the 'unpack_128_2x128' function for this:

    http://cgit.freedesktop.org/pixman/tree/pixman/pixman-sse2.c?id=pixman-0.32.6#n68

    static force_inline void
    unpack_128_2x128 (__m128i data, __m128i* data_lo, __m128i* data_hi)
    {
        *data_lo = _mm_unpacklo_epi8 (data, _mm_setzero_si128 ());
        *data_hi = _mm_unpackhi_epi8 (data, _mm_setzero_si128 ());
    }

This function takes a single 128-bit vector with 8-bit elements as
the input argument and produces two 128-bit vectors (low and high part)
with 16-bit elements as the output. The code is simple and it is easy
to see what it does.

It would be great to have something similar in the powerpc code too.
The 'force_inline' attribute ensures that the compiler inlines this
function, so there is no call overhead. You can (and are also
encouraged to) verify that the performance is not regressed by
using the 'lowlevel-blt-bench' program in the 'test' directory.

> +#endif //WORDS_BIGENDIAN
> +
>  static force_inline vector unsigned int
>  pix_add (vector unsigned int a, vector unsigned int b)
>  {
> @@ -129,29 +183,26 @@ over (vector unsigned int src,
>      over (pix_multiply (src, mask),					\
>            pix_multiply (srca, mask), dest)
>  
> +#ifdef WORDS_BIGENDIAN
>  
> -#define COMPUTE_SHIFT_MASK(source)					\
> +# define COMPUTE_SHIFT_MASK(source)					\
>      source ## _mask = vec_lvsl (0, source);
>  
> -#define COMPUTE_SHIFT_MASKS(dest, source)				\
> +# define COMPUTE_SHIFT_MASKS(dest, source)				\
>      source ## _mask = vec_lvsl (0, source);
>  
> -#define COMPUTE_SHIFT_MASKC(dest, source, mask)				\
> +# define COMPUTE_SHIFT_MASKC(dest, source, mask)			\
>      mask ## _mask = vec_lvsl (0, mask);					\
>      source ## _mask = vec_lvsl (0, source);
>  
> -/* notice you have to declare temp vars...
> - * Note: tmp3 and tmp4 must remain untouched!
> - */
> -
> -#define LOAD_VECTORS(dest, source)			  \
> +# define LOAD_VECTORS(dest, source)			  \
>      tmp1 = (typeof(tmp1))vec_ld (0, source);		  \
>      tmp2 = (typeof(tmp2))vec_ld (15, source);		  \
>      v ## source = (typeof(v ## source))			  \
>  	vec_perm (tmp1, tmp2, source ## _mask);		  \
>      v ## dest = (typeof(v ## dest))vec_ld (0, dest);
>  
> -#define LOAD_VECTORSC(dest, source, mask)		  \
> +# define LOAD_VECTORSC(dest, source, mask)		  \
>      tmp1 = (typeof(tmp1))vec_ld (0, source);		  \
>      tmp2 = (typeof(tmp2))vec_ld (15, source);		  \
>      v ## source = (typeof(v ## source))			  \
> @@ -162,6 +213,34 @@ over (vector unsigned int src,
>      v ## mask = (typeof(v ## mask))			  \
>  	vec_perm (tmp1, tmp2, mask ## _mask);
>  
> +#else //WORDS_BIGENDIAN
> +
> +/* Now the COMPUTE_SHIFT_{MASK, MASKS, MASKC} below are just no-op.
> + * They are defined that way because little endian altivec can do unaligned
> + * reads natively and have no need for constructing the permutation pattern
> + * variables.
> + */
> +# define COMPUTE_SHIFT_MASK(source)
> +
> +# define COMPUTE_SHIFT_MASKS(dest, source)
> +
> +# define COMPUTE_SHIFT_MASKC(dest, source, mask)
> +
> +# define LOAD_VECTORS(dest, source)                        \
> +    v ## source = *((typeof(v ## source)*)source);        \
> +    v ## dest = *((typeof(v ## dest)*)dest);
> +
> +# define LOAD_VECTORSC(dest, source, mask)                 \
> +    v ## source = *((typeof(v ## source)*)source);        \
> +    v ## dest = *((typeof(v ## dest)*)dest);              \
> +    v ## mask = *((typeof(v ## mask)*)mask);
> +
> +#endif //WORDS_BIGENDIAN
> +
> +/* notice you have to declare temp vars...
> + * Note: tmp3 and tmp4 must remain untouched!
> + */
> +
>  #define LOAD_VECTORSM(dest, source, mask)				\
>      LOAD_VECTORSC (dest, source, mask)					\
>      v ## source = pix_multiply (v ## source,				\

This part looks good to me.

-- 
Best regards,
Siarhei Siamashka