[Mesa-dev] [PATCH 3/3] gallivm: optimize gather a bit, by using supplied destination type
Jose Fonseca
jfonseca at vmware.com
Tue Dec 6 15:31:40 UTC 2016
On 03/12/16 16:12, sroland at vmware.com wrote:
> From: Roland Scheidegger <sroland at vmware.com>
>
> By using a dst_type in the the gather interface, gather has some more
> knowledge about how values should be fetched.
> E.g. if this is a 3x32bit fetch and dst_type is 4x32bit vector gather
> will no longer do a ZExt with a 96bit scalar value to 128bit, but
> just fetch the 96bit as 3x32bit vector (this is still going to be
> 2 loads of course, but the loads can be done directly to simd vector
> that way).
> Also, we can now do some try to use the right int/float type. This should
> make no difference really since there's typically no domain transition
> penalties for such simd loads, however it actually makes a difference
> since llvm will use different shuffle lowering afterwards so the caller
> can use this to trick llvm into using sane shuffle afterwards (and yes
> llvm is really stupid there - nothing against using the shuffle
> instruction from the correct domain, but not at the cost of doing 3 times
> more shuffles, the case which actually matters is refusal to use shufps
> for integer values).
> Also do some attempt to avoid things which look great on paper but llvm
> doesn't really handle (e.g. fetching 3-element 8 bit and 16 bit vectors
> which is simply disastrous - I suspect type legalizer is to blame trying
> to extend these vectors to 128bit types somehow, so fetching these with
> scalars like before which is suboptimal due to the ZExt).
>
> Remove the ability for truncation (no point, this is gather, not conversion)
> as it is complex enough already.
>
> While here also implement not just the float, but also the 64bit avx2
> gathers (disabled though since based on the theoretical numbers the benefit
> just isn't there at all until Skylake at least).
> ---
> src/gallium/auxiliary/gallivm/lp_bld_gather.c | 42 +++++++++++++++++++++++++--
> 1 file changed, 39 insertions(+), 3 deletions(-)
>
> diff --git a/src/gallium/auxiliary/gallivm/lp_bld_gather.c b/src/gallium/auxiliary/gallivm/lp_bld_gather.c
> index 439bbb6..1f7ba92 100644
> --- a/src/gallium/auxiliary/gallivm/lp_bld_gather.c
> +++ b/src/gallium/auxiliary/gallivm/lp_bld_gather.c
> @@ -33,6 +33,7 @@
> #include "lp_bld_format.h"
> #include "lp_bld_gather.h"
> #include "lp_bld_swizzle.h"
> +#include "lp_bld_type.h"
> #include "lp_bld_init.h"
> #include "lp_bld_intr.h"
>
> @@ -270,17 +271,52 @@ lp_build_gather(struct gallivm_state *gallivm,
>
> LLVMTypeRef dst_elem_type = LLVMIntTypeInContext(gallivm->context, dst_width);
> LLVMTypeRef dst_vec_type = LLVMVectorType(dst_elem_type, length);
> + LLVMTypeRef gather_vec_type = dst_vec_type;
> unsigned i;
> -
> - res = LLVMGetUndef(dst_vec_type);
> + boolean vec_zext = FALSE;
> + unsigned gather_width = dst_width;
> +
> +
> + if (src_width == 16 && dst_width == 32) {
> + LLVMTypeRef g_elem_type = LLVMIntTypeInContext(gallivm->context, dst_width / 2);
> + gather_vec_type = LLVMVectorType(g_elem_type, length);
> + /*
> + * Note that llvm is never able to optimize zext/insert combos
> + * directly (i.e. zero the simd reg, then place the elements into
> + * the appropriate place directly). And 16->32bit zext simd loads
> + * aren't possible (instead loading to scalar reg first).
> + * (I think this has to do with scalar/vector transition.)
> + * No idea about other archs...
> + * We could do this manually, but instead we just use a vector
> + * zext, which is simple enough (and, in fact, llvm might optimize
> + * this away).
> + * (We're not trying that with other bit widths as that might not be
> + * easier, in particular with 8 bit values at least with only sse2.)
> + */
> + vec_zext = TRUE;
> + gather_width = 16;
> + }
> + res = LLVMGetUndef(gather_vec_type);
> for (i = 0; i < length; ++i) {
> LLVMValueRef index = lp_build_const_int32(gallivm, i);
> LLVMValueRef elem;
> elem = lp_build_gather_elem(gallivm, length,
> - src_width, dst_width, aligned,
> + src_width, gather_width, aligned,
> base_ptr, offsets, i, vector_justify);
> res = LLVMBuildInsertElement(gallivm->builder, res, elem, index, "");
> }
> + if (vec_zext) {
> + res = LLVMBuildZExt(gallivm->builder, res, dst_vec_type, "");
> + if (vector_justify) {
> +#if PIPE_ARCH_BIG_ENDIAN
> + struct lp_type dst_type;
> + unsigned sv = dst_width - src_width;
> + dst_type = lp_type_uint_vec(dst_width, dst_width * length);
> + res = LLVMBuildShl(gallivm->builder, res,
> + lp_build_const_int_vec(gallivm, dst_type, sv), "");
> +#endif
> + }
> + }
> }
>
> return res;
>
Series looks good to me.
Reviewed-by: Jose Fonseca <jfonseca at vmware.com>
More information about the mesa-dev
mailing list