[Mesa-dev] [PATCH 11/11] HACK: nir/lower_vec_to_movs: Coalesce into destinations of fdot instructions

Wed Sep 9 17:50:14 PDT 2015

This is labeled HACK because it relies on the validator not validating
destination write-masks with respect to nir_op_infos[op].output_size and
relies on the fact that the i965 vec4 hardware does an implicit splat of
any dot-product results.  Technically, nir_op_fdotN produces a single
component that lives in the first slot and the others don't exist.
However, most hardware splats dot-products so this is probably reasonable.

One solution to doing this "properly" would be to add a set of
nir_op_fdotN_replicated opcodes that do the splat and somehow lower to
those at the end of optimizations.  I don't think we want to have the
nir_op_fdot opcodes splat in SSA form because that could hurt our
opportunity for CSE.  However, the shader-db ressults below show that we
might want to have it splat for the purposes of coalescing.

Shader-db results for vec4 programs on Haswell:

   total instructions in shared programs: 1778849 -> 1751223 (-1.55%)
   instructions in affected programs:     763104 -> 735478 (-3.62%)
   helped:                                7067
   HURT:                                  26

It turns out that dot-products matter...

Cc: Eduardo Lima Mitev <elima at igalia.com>
---
 src/glsl/nir/nir_lower_vec_to_movs.c | 47 ++++++++++++++++++++++++++----------
 1 file changed, 34 insertions(+), 13 deletions(-)

diff --git a/src/glsl/nir/nir_lower_vec_to_movs.c b/src/glsl/nir/nir_lower_vec_to_movs.c
index 0ebf3e3..1aa6add 100644
--- a/src/glsl/nir/nir_lower_vec_to_movs.c
+++ b/src/glsl/nir/nir_lower_vec_to_movs.c
@@ -84,6 +84,14 @@ insert_mov(nir_alu_instr *vec, unsigned start_idx, nir_shader *shader)
    return mov->dest.write_mask;
 }
 
+static bool
+is_fdot(nir_alu_instr *alu)
+{
+   return alu->op == nir_op_fdot2 ||
+          alu->op == nir_op_fdot3 ||
+          alu->op == nir_op_fdot4;
+}
+
 /* Attempts to coalesce the "move" from the given source of the vec to the
  * destination of the instruction generating the value. If, for whatever
  * reason, we cannot coalesce the mmove, it does nothing and returns 0.  We
@@ -121,19 +129,28 @@ try_coalesce(nir_alu_instr *vec, unsigned start_idx, nir_shader *shader)
    nir_alu_instr *src_alu =
       nir_instr_as_alu(vec->src[start_idx].src.ssa->parent_instr);
 
-   /* We only care about being able to re-swizzle the instruction if it is
-    * something that we can reswizzle.  It must be per-component.
-    */
-   if (nir_op_infos[src_alu->op].output_size != 0)
-      return 0;
-
-   /* If we are going to reswizzle the instruction, we can't have any
-    * non-per-component sources either.
-    */
-   for (unsigned j = 0; j < nir_op_infos[src_alu->op].num_inputs; j++)
-      if (nir_op_infos[src_alu->op].input_sizes[j] != 0)
+   if (is_fdot(src_alu)) {
+      /* The fdot instruction is special: It splats its result to all
+       * components.  This means that we can always rewrite its destination
+       * and we don't need to swizzle anything.
+       */
+   } else {
+      /* We only care about being able to re-swizzle the instruction if it is
+       * something that we can reswizzle.  It must be per-component.  The one
+       * exception to this is the fdotN instructions which implicitly splat
+       * their result out to all channels.
+       */
+      if (nir_op_infos[src_alu->op].output_size != 0)
          return 0;
 
+      /* If we are going to reswizzle the instruction, we can't have any
+       * non-per-component sources either.
+       */
+      for (unsigned j = 0; j < nir_op_infos[src_alu->op].num_inputs; j++)
+         if (nir_op_infos[src_alu->op].input_sizes[j] != 0)
+            return 0;
+   }
+
    /* Stash off all of the ALU instruction's swizzles. */
    uint8_t swizzles[4][4];
    for (unsigned j = 0; j < nir_op_infos[src_alu->op].num_inputs; j++)
@@ -153,8 +170,12 @@ try_coalesce(nir_alu_instr *vec, unsigned start_idx, nir_shader *shader)
        * instruction so we can re-swizzle that component to match.
        */
       write_mask |= 1 << i;
-      for (unsigned j = 0; j < nir_op_infos[src_alu->op].num_inputs; j++)
-         src_alu->src[j].swizzle[i] = swizzles[j][vec->src[i].swizzle[0]];
+      if (is_fdot(src_alu)) {
+         /* Since fdot splats, we don't need to do any reswizzling */
+      } else {
+         for (unsigned j = 0; j < nir_op_infos[src_alu->op].num_inputs; j++)
+            src_alu->src[j].swizzle[i] = swizzles[j][vec->src[i].swizzle[0]];
+      }
 
       /* Clear the no longer needed vec source */
       nir_instr_rewrite_src(&vec->instr, &vec->src[i].src, NIR_SRC_INIT);
-- 
2.5.0.400.gff86faf