[Mesa-dev] [PATCH 4/4] i965: Implement dual-patch tessellation evaluation shaders.

Fri Feb 23 08:36:15 UTC 2018

Normally, SIMD8 tessellation evaluation shaders operate on a single
patch, with each channel operating on a different vertex within the
patch.  For patch primitives with fewer than 8 vertices, this means
that some of the channels are disabled, effectively wasting compute
power.  This is a fairly common case.

To solve this, Skylake and later add "dual-patch" domain shader support.
Dual-patch mode only supports PATCHLIST_1..4.  The first four channels
process vertices in the first patch, while the second four channels
process vertices from a second patch.  This can double the throughput.

Similar to pixel shader SIMD8 vs. SIMD16 handling, 3DSTATE_DS accepts
two KSPs - one for single-patch mode, and one for dual-patch mode.  The
hardware will dynamically dispatch whichever kernel makes sense.

The two shader modes are nearly identical.  The three differences are:

- There are two URB handles instead of one.
- Dual-patch has an extra payload register (g1) for PrimitiveID.
  (single-patch packs it in g0, but they couldn't fit two IDs there)
- Pushed input data arrives interleaved rather than packed (in SIMD4x2
  style rather than SIMD4x1 style).  For example, varyings 'a' and 'b'
  would show up as follows:

  Single patch (SIMD4x1 style inputs):
  g6: | b1.x | b1.y | b1.z | b1.w | a1.x | a1.y | a1.z | a1.w |

  Dual patch (SIMD4x2 style inputs):
  g6: | a2.x | a2.y | a2.z | a2.w | a1.x | a1.y | a1.z | a1.w |
  g7: | b2.x | b2.y | b2.z | b2.w | b1.x | b1.y | b1.z | b1.w |

This is fairly easy to adjust for - in fact, we can take the FS IR
we already generated for single-patch mode and transform it to
create a dual-patch shader.  We only need to repeat register allocation.

Our load_input handling for TES shaders always emits MOVs to expand
a scalar input to a full SIMD8 register.  While these may be copy
propagated away, it does mean all input file access will be a scalar
<0,1,0> region.  (The FS backend in theory could recognize things
like dot(input1, input2) and emit a vector DP4 operation, but it does
not do such things today, nor is it likely to gain such support.)

Likewise, our URB access reads a single 32-bit URB handle from the
payload, expanding it to 8 handles for the SIMD8 URB messages.  We
can adjust the region from <0,1,0> to <1,4,0> on that MOV to replicate
the first four times, and the second another four times.

Despite having a register for PrimitiveID, the documentation says
dual-patch mode is not allowed when PrimitiveID is in use.  So we
don't need to handle that.

This should be all that's required.

Improves performance in Synmark's Gl40TerrainFlyTess at 1024x768 on
Apollolake 3x6 with single-channel RAM by 0.161727% +/- 0.0686365%
(n=1062).
---
 src/intel/compiler/brw_compiler.h             |  2 +
 src/intel/compiler/brw_fs.cpp                 | 98 ++++++++++++++++++++++++++-
 src/intel/compiler/brw_fs.h                   |  2 +
 src/intel/compiler/brw_fs_visitor.cpp         |  1 +
 src/intel/compiler/brw_shader.cpp             |  3 +
 src/mesa/drivers/dri/i965/genX_state_upload.c |  7 ++
 6 files changed, 112 insertions(+), 1 deletion(-)

This is not that impressive in itself, but it seems like it can only
help...the hardware ought to automatically dispatch in dual mode if
half the channels in an invocation were going to be wasted.

diff --git a/src/intel/compiler/brw_compiler.h b/src/intel/compiler/brw_compiler.h
index b1086bbcee5..311cff5a33d 100644
--- a/src/intel/compiler/brw_compiler.h
+++ b/src/intel/compiler/brw_compiler.h
@@ -986,6 +986,8 @@ struct brw_tes_prog_data
    enum brw_tess_partitioning partitioning;
    enum brw_tess_output_topology output_topology;
    enum brw_tess_domain domain;
+
+   uint32_t prog_offset_dual_patch;
 };
 
 struct brw_gs_prog_data
diff --git a/src/intel/compiler/brw_fs.cpp b/src/intel/compiler/brw_fs.cpp
index bed632d21b9..02383280932 100644
--- a/src/intel/compiler/brw_fs.cpp
+++ b/src/intel/compiler/brw_fs.cpp
@@ -6278,6 +6278,16 @@ fs_visitor::run_tcs_single_patch()
    return !failed;
 }
 
+static void
+copy_fs_inst_list(void *mem_ctx, exec_list *dst, exec_list *src)
+{
+   dst->make_empty();
+
+   foreach_in_list(fs_inst, inst, src) {
+      dst->push_tail(new(mem_ctx) fs_inst(*inst));
+   }
+}
+
 bool
 fs_visitor::run_tes()
 {
@@ -6307,9 +6317,95 @@ fs_visitor::run_tes()
    assign_tes_urb_setup();
 
    fixup_3src_null_dest();
+
+   /* The Skylake 3DSTATE_DS documentation says:
+    *
+    *    "SIMD8_SINGLE_OR_DUAL_PATCH must not be used if the domain shader
+    *     kernel uses primitive id."
+    *
+    * So, don't bother compiling a dual-patch shader in that case.
+    */
+   cfg_t *single_patch_cfg = cfg;
+   if (devinfo->gen >= 9 &&
+       (nir->info.system_values_read & SYSTEM_BIT_PRIMITIVE_ID) == 0 &&
+       !(INTEL_DEBUG & DEBUG_NO_DUAL_OBJECT_GS)) {
+      dual_patch_cfg = new(mem_ctx) cfg_t(*cfg, copy_fs_inst_list);
+   }
+
    allocate_registers(8, true);
 
-   return !failed;
+   if (failed)
+      return false;
+
+   if (dual_patch_cfg) {
+      invalidate_live_intervals();
+
+      /* Restore the CFG prior to register allocation. */
+      cfg = dual_patch_cfg;
+      validate();
+
+      struct brw_vue_prog_data *vue_prog_data = brw_vue_prog_data(prog_data);
+      int old_urb_start = payload.num_regs + prog_data->curb_read_length;
+
+      /* Remap payload register access. */
+      foreach_block_and_inst(block, fs_inst, inst, dual_patch_cfg) {
+         for (int i = 0; i < inst->sources; i++) {
+            if (inst->src[i].file != FIXED_GRF ||
+                inst->src[i].nr >= first_non_payload_grf)
+               continue;
+
+            struct brw_reg &reg = inst->src[i].as_brw_reg();
+
+            if (reg.nr == 0 && reg.subnr == 0) {
+               /* The patch URB handles are g0<1,4,0>, not g0<0,1,0>. */
+               if (has_scalar_region(reg))
+                  reg = stride(reg, 1, 4, 0);
+            } else if (reg.nr >= 1 && reg.nr < old_urb_start) {
+               /* gl_TessCoord, URB return handles, and push constants
+                * move up one register to account for g1 being PrimitiveID.
+                */
+               reg.nr++;
+            } else {
+               assert(reg.nr >= old_urb_start);
+               assert(has_scalar_region(reg));
+               assert(reg.subnr / 16 <= 1);
+
+               reg = stride(reg, 16 / type_sz(reg.type), 4, 0);
+
+               int attr = reg.nr - old_urb_start;
+               reg.nr = 1 + old_urb_start + (2 * attr) + (reg.subnr / 16);
+               reg.subnr %= 16;
+
+               if (inst->is_3src(devinfo)) {
+                  int c = reg.subnr / 4;
+                  reg.swizzle = BRW_SWIZZLE4(c, c, c, c);
+                  reg.subnr = 0;
+               }
+            }
+         }
+      }
+
+      /* Adjust the first non-payload GRF for the extra input registers
+       * as well as the extra g1 (PrimitiveID) payload register.  This
+       * ensures that the register allocator leaves those be.
+       *
+       * It would make sense to increment payload.num_regs for the extra
+       * g1 (PrimitiveID) register, but we actually want to leave it alone
+       * so prog_data->dispatch_grf_start_reg does not include it.  The
+       * hardware automatically increments that value by 1 for dual-patch
+       * shaders to account for it.
+       */
+      first_non_payload_grf += 1 + 8 * vue_prog_data->urb_read_length;
+
+      allocate_registers(8, true);
+
+      if (failed)
+         dual_patch_cfg = NULL;
+
+      cfg = single_patch_cfg;
+   }
+
+   return true;
 }
 
 bool
diff --git a/src/intel/compiler/brw_fs.h b/src/intel/compiler/brw_fs.h
index 63373580ee4..6df0678ee0b 100644
--- a/src/intel/compiler/brw_fs.h
+++ b/src/intel/compiler/brw_fs.h
@@ -374,6 +374,8 @@ public:
 
    unsigned promoted_constants;
    brw::fs_builder bld;
+
+   cfg_t *dual_patch_cfg;
 };
 
 /**
diff --git a/src/intel/compiler/brw_fs_visitor.cpp b/src/intel/compiler/brw_fs_visitor.cpp
index 7a5f6451f2b..e8bc8c6d255 100644
--- a/src/intel/compiler/brw_fs_visitor.cpp
+++ b/src/intel/compiler/brw_fs_visitor.cpp
@@ -900,6 +900,7 @@ fs_visitor::init()
 
    this->grf_used = 0;
    this->spilled_any_registers = false;
+   this->dual_patch_cfg = NULL;
 }
 
 fs_visitor::~fs_visitor()
diff --git a/src/intel/compiler/brw_shader.cpp b/src/intel/compiler/brw_shader.cpp
index 1df4f35cd8e..831f40c290e 100644
--- a/src/intel/compiler/brw_shader.cpp
+++ b/src/intel/compiler/brw_shader.cpp
@@ -1264,6 +1264,9 @@ brw_compile_tes(const struct brw_compiler *compiler,
 
       g.generate_code(v.cfg, 8);
 
+      prog_data->prog_offset_dual_patch =
+         v.dual_patch_cfg ? g.generate_code(v.dual_patch_cfg, 8) : 0;
+
       assembly = g.get_assembly(&prog_data->base.base.program_size);
    } else {
       brw::vec4_tes_visitor v(compiler, log_data, key, prog_data,
diff --git a/src/mesa/drivers/dri/i965/genX_state_upload.c b/src/mesa/drivers/dri/i965/genX_state_upload.c
index 8668abd591f..ceef11d46e1 100644
--- a/src/mesa/drivers/dri/i965/genX_state_upload.c
+++ b/src/mesa/drivers/dri/i965/genX_state_upload.c
@@ -3984,6 +3984,13 @@ genX(upload_ds_state)(struct brw_context *brw)
            ds.DispatchMode = DISPATCH_MODE_SIMD8_SINGLE_PATCH;
         ds.UserClipDistanceCullTestEnableBitmask =
             vue_prog_data->cull_distance_mask;
+#endif
+#if GEN_GEN >= 9
+        if (tes_prog_data->prog_offset_dual_patch > 0) {
+           ds.DispatchMode = DISPATCH_MODE_SIMD8_SINGLE_OR_DUAL_PATCH;
+           ds.DUAL_PATCHKernelStartPointer = stage_state->prog_offset +
+              tes_prog_data->prog_offset_dual_patch;
+        }
 #endif
       }
    }
-- 
2.16.1