[Mesa-dev] [PATCH] i965/vec4: Opportunistically coalesce SIMD8 instructions

Tue Feb 17 16:44:21 PST 2015

With scalar VS, it so happens that many vertex shaders will line up in a such a
way that two SIMD8 instructions can be collapsed into 1 SIMD16 instruction. For
example

The following two MOVs
mov(8)          g124<1>F        g6<8,8,1>F                      { align1 1Q compacted };
mov(8)          g125<1>F        g7<8,8,1>F                      { align1 1Q compacted };

Could be represented as a single MOV
mov(16)         g124<1>F        g6<8,8,1>F                      { align1 1H compacted };

The basic algorithm is very simple. For two consecutive instructions, check if
all source, and dst registers are adjacent. If so, reuse the first instruction
by adjusting the compression bits and then killing the second instruction. The
caveat is (shown above) is 1Q->1H is insufficient. As mentioned in the comments,
the second quarter of the DMask is invalid for us, so we actually must generate
the follow if possible:
mov(16)         g124<1>F        g6<8,8,1>F                      { align1 WE_all 1H compacted };

The next step would be to try informing the instruction scheduler and register
allocator to make this happen more often. Anecdotally the most often occurance
is for the blit shader generated by meta, and it always leaves things in good
order for us.

The scalar VS is only available on later platforms. This same thing could be
applied to the FS, but there we hope to be using SIMD16 already for most
instructions. It shouldn't hurt to throw this same optimization at the FS for
cases where we have to fall back though.

Cc: Kenneth Graunke <kenneth at whitecape.org>
Cc: Kristian Høgsberg <krh at bitplanet.net>
Signed-off-by: Ben Widawsky <ben at bwidawsk.net>
---

I have no had time to benchmark this very much, nor run piglit on it. I am just
sending it out before it bitrots too much further.

---

 src/mesa/drivers/dri/i965/brw_fs.cpp | 74 ++++++++++++++++++++++++++++++++++++
 src/mesa/drivers/dri/i965/brw_fs.h   |  1 +
 2 files changed, 75 insertions(+)

diff --git a/src/mesa/drivers/dri/i965/brw_fs.cpp b/src/mesa/drivers/dri/i965/brw_fs.cpp
index 200a494..cc21cdf 100644
--- a/src/mesa/drivers/dri/i965/brw_fs.cpp
+++ b/src/mesa/drivers/dri/i965/brw_fs.cpp
@@ -3716,6 +3716,78 @@ fs_visitor::allocate_registers()
       prog_data->total_scratch = brw_get_scratch_size(last_scratch);
 }
 
+static bool
+is_ops_adjacent(fs_inst *a, fs_inst *b)
+{
+   if (a->opcode != b->opcode)
+      return false;
+
+   if (a->dst.reg != b->dst.reg - 1)
+      return false;
+
+   assert(a->sources == b->sources);
+
+   for (int i = 0; i < a->sources; i++) {
+      if (a->src[i].file != b->src[i].file)
+         return false;
+
+      if (a->src[i].file == HW_REG &&
+          (a->src[i].fixed_hw_reg.nr == b->src[i].fixed_hw_reg.nr - 1))
+         continue;
+      else if (a->src[i].file == GRF &&
+               (a->src[i].reg ==  b->src[i].reg - 1))
+         continue;
+      else if (a->src[i].file == IMM &&
+               a->src[i].fixed_hw_reg.dw1.ud == b->src[i].fixed_hw_reg.dw1.ud)
+         continue;
+
+      return false;
+   }
+
+   return true;
+}
+
+/* Try to upconvert a SIMD8 instruction into a fake SIMD16 instruction.
+ *
+ * If we have two operations in sequence, and they are using sequentially
+ * contiguous operands, the two SIMD8 instructions may be combined into 1 SIMD16
+ * instruction. For example:
+ * mov(8)          g124<1>F        g6<8,8,1>F
+ * mov(8)          g125<1>F        g7<8,8,1>F
+ *
+ * Is the same as:
+ * mov(16)         g124<1>F        g6<8,8,1>F
+ *
+ * This is trickier than it initially sounds. On the surface it sounds like a
+ * good idea to simply combine the instructions as shown above, and convert
+ * 1Q->1H. The main problem is that we're executing the shader with SIMD8 mode.
+ * This means that 1/4 of the DMask is useful, and the rest is junk. All we can
+ * do therefore is use WE_all if possible.
+ */
+void
+fs_visitor::instruction_coalesce()
+{
+   assert(dispatch_width == 8);
+   fs_inst *prev = NULL;
+
+   /* Predication still obeys WE_all, however control flow does not. For now,
+    * simply bail if the shader has control flow */
+   if (cfg->num_blocks > 1) {
+      perf_debug("Optimization skipped because of control flow in shader\n");
+      return;
+   }
+
+   foreach_block_and_inst_safe(block, fs_inst, inst, cfg) {
+      if (prev && (prev->exec_size == 8 && inst->exec_size == 8) &&
+          is_ops_adjacent(prev, inst)) {
+         prev->exec_size = 16;
+         prev->force_writemask_all = true;
+         inst->remove(block);
+      } else
+         prev = inst;
+   }
+}
+
 bool
 fs_visitor::run_vs()
 {
@@ -3746,6 +3818,8 @@ fs_visitor::run_vs()
    fixup_3src_null_dest();
    allocate_registers();
 
+   instruction_coalesce();
+
    return !failed;
 }
 
diff --git a/src/mesa/drivers/dri/i965/brw_fs.h b/src/mesa/drivers/dri/i965/brw_fs.h
index b95e2c0..8be5069 100644
--- a/src/mesa/drivers/dri/i965/brw_fs.h
+++ b/src/mesa/drivers/dri/i965/brw_fs.h
@@ -456,6 +456,7 @@ public:
                                  exec_list *acp);
    bool opt_register_renaming();
    bool register_coalesce();
+   void instruction_coalesce();
    bool compute_to_mrf();
    bool dead_code_eliminate();
    bool remove_duplicate_mrf_writes();
-- 
2.3.0