[Mesa-dev] [PATCH 3/3] gallivm: optimize gather a bit, by using supplied destination type

Tue Dec 6 15:31:40 UTC 2016

On 03/12/16 16:12, sroland at vmware.com wrote:
> From: Roland Scheidegger <sroland at vmware.com>
>
> By using a dst_type in the the gather interface, gather has some more
> knowledge about how values should be fetched.
> E.g. if this is a 3x32bit fetch and dst_type is 4x32bit vector gather
> will no longer do a ZExt with a 96bit scalar value to 128bit, but
> just fetch the 96bit as 3x32bit vector (this is still going to be
> 2 loads of course, but the loads can be done directly to simd vector
> that way).
> Also, we can now do some try to use the right int/float type. This should
> make no difference really since there's typically no domain transition
> penalties for such simd loads, however it actually makes a difference
> since llvm will use different shuffle lowering afterwards so the caller
> can use this to trick llvm into using sane shuffle afterwards (and yes
> llvm is really stupid there - nothing against using the shuffle
> instruction from the correct domain, but not at the cost of doing 3 times
> more shuffles, the case which actually matters is refusal to use shufps
> for integer values).
> Also do some attempt to avoid things which look great on paper but llvm
> doesn't really handle (e.g. fetching 3-element 8 bit and 16 bit vectors
> which is simply disastrous - I suspect type legalizer is to blame trying
> to extend these vectors to 128bit types somehow, so fetching these with
> scalars like before which is suboptimal due to the ZExt).
>
> Remove the ability for truncation (no point, this is gather, not conversion)
> as it is complex enough already.
>
> While here also implement not just the float, but also the 64bit avx2
> gathers (disabled though since based on the theoretical numbers the benefit
> just isn't there at all until Skylake at least).
> ---
>  src/gallium/auxiliary/gallivm/lp_bld_gather.c | 42 +++++++++++++++++++++++++--
>  1 file changed, 39 insertions(+), 3 deletions(-)
>
> diff --git a/src/gallium/auxiliary/gallivm/lp_bld_gather.c b/src/gallium/auxiliary/gallivm/lp_bld_gather.c
> index 439bbb6..1f7ba92 100644
> --- a/src/gallium/auxiliary/gallivm/lp_bld_gather.c
> +++ b/src/gallium/auxiliary/gallivm/lp_bld_gather.c
> @@ -33,6 +33,7 @@
>  #include "lp_bld_format.h"
>  #include "lp_bld_gather.h"
>  #include "lp_bld_swizzle.h"
> +#include "lp_bld_type.h"
>  #include "lp_bld_init.h"
>  #include "lp_bld_intr.h"
>
> @@ -270,17 +271,52 @@ lp_build_gather(struct gallivm_state *gallivm,
>
>        LLVMTypeRef dst_elem_type = LLVMIntTypeInContext(gallivm->context, dst_width);
>        LLVMTypeRef dst_vec_type = LLVMVectorType(dst_elem_type, length);
> +      LLVMTypeRef gather_vec_type = dst_vec_type;
>        unsigned i;
> -
> -      res = LLVMGetUndef(dst_vec_type);
> +      boolean vec_zext = FALSE;
> +      unsigned gather_width = dst_width;
> +
> +
> +      if (src_width == 16 && dst_width == 32) {
> +         LLVMTypeRef g_elem_type = LLVMIntTypeInContext(gallivm->context, dst_width / 2);
> +         gather_vec_type = LLVMVectorType(g_elem_type, length);
> +         /*
> +          * Note that llvm is never able to optimize zext/insert combos
> +          * directly (i.e. zero the simd reg, then place the elements into
> +          * the appropriate place directly). And 16->32bit zext simd loads
> +          * aren't possible (instead loading to scalar reg first).
> +          * (I think this has to do with scalar/vector transition.)
> +          * No idea about other archs...
> +          * We could do this manually, but instead we just use a vector
> +          * zext, which is simple enough (and, in fact, llvm might optimize
> +          * this away).
> +          * (We're not trying that with other bit widths as that might not be
> +          * easier, in particular with 8 bit values at least with only sse2.)
> +          */
> +         vec_zext = TRUE;
> +         gather_width = 16;
> +      }
> +      res = LLVMGetUndef(gather_vec_type);
>        for (i = 0; i < length; ++i) {
>           LLVMValueRef index = lp_build_const_int32(gallivm, i);
>           LLVMValueRef elem;
>           elem = lp_build_gather_elem(gallivm, length,
> -                                     src_width, dst_width, aligned,
> +                                     src_width, gather_width, aligned,
>                                       base_ptr, offsets, i, vector_justify);
>           res = LLVMBuildInsertElement(gallivm->builder, res, elem, index, "");
>        }
> +      if (vec_zext) {
> +         res = LLVMBuildZExt(gallivm->builder, res, dst_vec_type, "");
> +         if (vector_justify) {
> +#if PIPE_ARCH_BIG_ENDIAN
> +            struct lp_type dst_type;
> +            unsigned sv = dst_width - src_width;
> +            dst_type = lp_type_uint_vec(dst_width, dst_width * length);
> +            res = LLVMBuildShl(gallivm->builder, res,
> +                               lp_build_const_int_vec(gallivm, dst_type, sv), "");
> +#endif
> +         }
> +      }
>     }
>
>     return res;
>

Series looks good to me.

Reviewed-by: Jose Fonseca <jfonseca at vmware.com>