[Mesa-dev] [PATCH 3/3] gallivm: optimize gather a bit, by using supplied destination type

Sat Dec 3 16:12:43 UTC 2016

From: Roland Scheidegger <sroland at vmware.com>

By using a dst_type in the the gather interface, gather has some more
knowledge about how values should be fetched.
E.g. if this is a 3x32bit fetch and dst_type is 4x32bit vector gather
will no longer do a ZExt with a 96bit scalar value to 128bit, but
just fetch the 96bit as 3x32bit vector (this is still going to be
2 loads of course, but the loads can be done directly to simd vector
that way).
Also, we can now do some try to use the right int/float type. This should
make no difference really since there's typically no domain transition
penalties for such simd loads, however it actually makes a difference
since llvm will use different shuffle lowering afterwards so the caller
can use this to trick llvm into using sane shuffle afterwards (and yes
llvm is really stupid there - nothing against using the shuffle
instruction from the correct domain, but not at the cost of doing 3 times
more shuffles, the case which actually matters is refusal to use shufps
for integer values).
Also do some attempt to avoid things which look great on paper but llvm
doesn't really handle (e.g. fetching 3-element 8 bit and 16 bit vectors
which is simply disastrous - I suspect type legalizer is to blame trying
to extend these vectors to 128bit types somehow, so fetching these with
scalars like before which is suboptimal due to the ZExt).

Remove the ability for truncation (no point, this is gather, not conversion)
as it is complex enough already.

While here also implement not just the float, but also the 64bit avx2
gathers (disabled though since based on the theoretical numbers the benefit
just isn't there at all until Skylake at least).
---
 src/gallium/auxiliary/gallivm/lp_bld_gather.c | 42 +++++++++++++++++++++++++--
 1 file changed, 39 insertions(+), 3 deletions(-)

diff --git a/src/gallium/auxiliary/gallivm/lp_bld_gather.c b/src/gallium/auxiliary/gallivm/lp_bld_gather.c
index 439bbb6..1f7ba92 100644
--- a/src/gallium/auxiliary/gallivm/lp_bld_gather.c
+++ b/src/gallium/auxiliary/gallivm/lp_bld_gather.c
@@ -33,6 +33,7 @@
 #include "lp_bld_format.h"
 #include "lp_bld_gather.h"
 #include "lp_bld_swizzle.h"
+#include "lp_bld_type.h"
 #include "lp_bld_init.h"
 #include "lp_bld_intr.h"
 
@@ -270,17 +271,52 @@ lp_build_gather(struct gallivm_state *gallivm,
 
       LLVMTypeRef dst_elem_type = LLVMIntTypeInContext(gallivm->context, dst_width);
       LLVMTypeRef dst_vec_type = LLVMVectorType(dst_elem_type, length);
+      LLVMTypeRef gather_vec_type = dst_vec_type;
       unsigned i;
-
-      res = LLVMGetUndef(dst_vec_type);
+      boolean vec_zext = FALSE;
+      unsigned gather_width = dst_width;
+
+
+      if (src_width == 16 && dst_width == 32) {
+         LLVMTypeRef g_elem_type = LLVMIntTypeInContext(gallivm->context, dst_width / 2);
+         gather_vec_type = LLVMVectorType(g_elem_type, length);
+         /*
+          * Note that llvm is never able to optimize zext/insert combos
+          * directly (i.e. zero the simd reg, then place the elements into
+          * the appropriate place directly). And 16->32bit zext simd loads
+          * aren't possible (instead loading to scalar reg first).
+          * (I think this has to do with scalar/vector transition.)
+          * No idea about other archs...
+          * We could do this manually, but instead we just use a vector
+          * zext, which is simple enough (and, in fact, llvm might optimize
+          * this away).
+          * (We're not trying that with other bit widths as that might not be
+          * easier, in particular with 8 bit values at least with only sse2.)
+          */
+         vec_zext = TRUE;
+         gather_width = 16;
+      }
+      res = LLVMGetUndef(gather_vec_type);
       for (i = 0; i < length; ++i) {
          LLVMValueRef index = lp_build_const_int32(gallivm, i);
          LLVMValueRef elem;
          elem = lp_build_gather_elem(gallivm, length,
-                                     src_width, dst_width, aligned,
+                                     src_width, gather_width, aligned,
                                      base_ptr, offsets, i, vector_justify);
          res = LLVMBuildInsertElement(gallivm->builder, res, elem, index, "");
       }
+      if (vec_zext) {
+         res = LLVMBuildZExt(gallivm->builder, res, dst_vec_type, "");
+         if (vector_justify) {
+#if PIPE_ARCH_BIG_ENDIAN
+            struct lp_type dst_type;
+            unsigned sv = dst_width - src_width;
+            dst_type = lp_type_uint_vec(dst_width, dst_width * length);
+            res = LLVMBuildShl(gallivm->builder, res,
+                               lp_build_const_int_vec(gallivm, dst_type, sv), "");
+#endif
+         }
+      }
    }
 
    return res;
-- 
2.7.4