[Mesa-dev] [PATCH 6/6] draw: use SoA fetch, not AoS one

Tue Dec 20 14:20:20 UTC 2016

On 12/12/16 00:12, sroland at vmware.com wrote:
> From: Roland Scheidegger <sroland at vmware.com>
>
> Now that there's some SoA fetch which never falls back, we should usually get
> results which are better or at least not worse (something like rgba32f will
> stay the same). I suppose though it might be worse in some cases where the
> format doesn't require conversion (e.g. rg32f) and goes straight to output -
> if llvm was able to see through all shuffles then it might have been able
> to do away with the aos->soa->aos transpose entirely which can no longer work
> possibly except for 4-channel formats (due to replacing the undef channels
> with 0/1 before the second transpose and not the first - llvm will
> definitely not be able to figure that out). That might actually be quite
> common, but I'm not sure llvm really could optimize it in the first place,
> and if it's a problem we should just special case such inputs (though note
> that if conversion is needed, it isn't obvious if it's better to skip
> the transpose or do the conversion AoS-style).
>
> For cases which get way better, think something like R16_UNORM with 8-wide
> vectors: this was 8 sign-extend fetches, 8 cvt, 8 muls, followed by
> a couple of shuffles to stitch things together (if it is smart enough,
> 6 unpacks) and then a (8-wide) transpose (not sure if llvm could even
> optimize the shuffles + transpose, since the 16bit values were actually
> sign-extended to 128bit before being cast to a float vec, so that would be
> another 8 unpacks). Now that is just 8 fetches (directly inserted into
> vector, albeit there's one 128bit insert needed), 1 cvt, 1 mul.
> ---
>  src/gallium/auxiliary/draw/draw_llvm.c | 54 +++++++++++++++++++++++++---------
>  1 file changed, 40 insertions(+), 14 deletions(-)
>
> diff --git a/src/gallium/auxiliary/draw/draw_llvm.c b/src/gallium/auxiliary/draw/draw_llvm.c
> index 19b75a5..f895b76 100644
> --- a/src/gallium/auxiliary/draw/draw_llvm.c
> +++ b/src/gallium/auxiliary/draw/draw_llvm.c
> @@ -755,11 +755,9 @@ fetch_vector(struct gallivm_state *gallivm,
>               LLVMValueRef *inputs,
>               LLVMValueRef indices)
>  {
> -   LLVMValueRef zero = LLVMConstNull(LLVMInt32TypeInContext(gallivm->context));
>     LLVMBuilderRef builder = gallivm->builder;
>     struct lp_build_context blduivec;
>     LLVMValueRef offset, valid_mask;
> -   LLVMValueRef aos_fetch[LP_MAX_VECTOR_WIDTH / 32];
>     unsigned i;
>
>     lp_build_context_init(&blduivec, gallivm, lp_uint_type(vs_type));
> @@ -783,21 +781,49 @@ fetch_vector(struct gallivm_state *gallivm,
>     }
>
>     /*
> -    * Note: we probably really want to use SoA fetch, not AoS one (albeit
> -    * for most formats it will amount to the same as this isn't very
> -    * optimized). But looks dangerous since it assumes alignment.
> +    * Use SoA fetch. This should produce better code usually.
> +    * Albeit it's possible there's exceptions (in particular if the fetched
> +    * value is going directly to output if it's something like RG32F).
>      */
> -   for (i = 0; i < vs_type.length; i++) {
> -      LLVMValueRef offset1, elem;
> -      elem = lp_build_const_int32(gallivm, i);
> -      offset1 = LLVMBuildExtractElement(builder, offset, elem, "");
> +   if (1) {
> +      struct lp_type res_type = vs_type;
> +      /* The type handling is annoying here... */
> +      if (format_desc->colorspace == UTIL_FORMAT_COLORSPACE_RGB &&
> +          format_desc->channel[0].pure_integer) {
> +         if (format_desc->channel[0].type == UTIL_FORMAT_TYPE_SIGNED) {
> +            res_type = lp_type_int_vec(vs_type.width, vs_type.width * vs_type.length);
> +         }
> +         else if (format_desc->channel[0].type == UTIL_FORMAT_TYPE_UNSIGNED) {
> +            res_type = lp_type_uint_vec(vs_type.width, vs_type.width * vs_type.length);
> +         }
> +      }
>
> -      aos_fetch[i] = lp_build_fetch_rgba_aos(gallivm, format_desc,
> -                                             lp_float32_vec4_type(),
> -                                             FALSE, map_ptr, offset1,
> -                                             zero, zero, NULL);
> +      lp_build_fetch_rgba_soa(gallivm, format_desc,
> +                              res_type, FALSE, map_ptr, offset,
> +                              blduivec.zero, blduivec.zero,
> +                              NULL, inputs);
> +
> +      for (i = 0; i < TGSI_NUM_CHANNELS; i++) {
> +         inputs[i] = LLVMBuildBitCast(builder, inputs[i],
> +                                      lp_build_vec_type(gallivm, vs_type), "");
> +      }
> +
> +   }

> +   else {

Let's kill the old code path.  The multitude of live code paths is more 
than enough.  No point in keeping additional dead code paths around.

> +      LLVMValueRef zero = LLVMConstNull(LLVMInt32TypeInContext(gallivm->context));
> +      LLVMValueRef aos_fetch[LP_MAX_VECTOR_WIDTH / 32];
> +      for (i = 0; i < vs_type.length; i++) {
> +         LLVMValueRef offset1, elem;
> +         elem = lp_build_const_int32(gallivm, i);
> +         offset1 = LLVMBuildExtractElement(builder, offset, elem, "");
> +
> +         aos_fetch[i] = lp_build_fetch_rgba_aos(gallivm, format_desc,
> +                                                lp_float32_vec4_type(),
> +                                                FALSE, map_ptr, offset1,
> +                                                zero, zero, NULL);
> +      }
> +      convert_to_soa(gallivm, aos_fetch, inputs, vs_type);
>     }
> -   convert_to_soa(gallivm, aos_fetch, inputs, vs_type);
>
>     for (i = 0; i < TGSI_NUM_CHANNELS; i++) {
>        inputs[i] = LLVMBuildBitCast(builder, inputs[i], blduivec.vec_type, "");
>

Reviewed-by: Jose Fonseca <jfonseca at vmware.com>