[Mesa-dev] [PATCH 4/4] i965: Implement dual-patch tessellation evaluation shaders.

Fri Feb 23 09:42:28 UTC 2018

Hi,

Could you also give dual & single shaders different names when dumping
them, e.g. something like the attached patch?

	- Eero

On 23.02.2018 10:36, Kenneth Graunke wrote:
> Normally, SIMD8 tessellation evaluation shaders operate on a single
> patch, with each channel operating on a different vertex within the
> patch.  For patch primitives with fewer than 8 vertices, this means
> that some of the channels are disabled, effectively wasting compute
> power.  This is a fairly common case.
> 
> To solve this, Skylake and later add "dual-patch" domain shader support.
> Dual-patch mode only supports PATCHLIST_1..4.  The first four channels
> process vertices in the first patch, while the second four channels
> process vertices from a second patch.  This can double the throughput.
> 
> Similar to pixel shader SIMD8 vs. SIMD16 handling, 3DSTATE_DS accepts
> two KSPs - one for single-patch mode, and one for dual-patch mode.  The
> hardware will dynamically dispatch whichever kernel makes sense.
> 
> The two shader modes are nearly identical.  The three differences are:
> 
> - There are two URB handles instead of one.
> - Dual-patch has an extra payload register (g1) for PrimitiveID.
>    (single-patch packs it in g0, but they couldn't fit two IDs there)
> - Pushed input data arrives interleaved rather than packed (in SIMD4x2
>    style rather than SIMD4x1 style).  For example, varyings 'a' and 'b'
>    would show up as follows:
> 
>    Single patch (SIMD4x1 style inputs):
>    g6: | b1.x | b1.y | b1.z | b1.w | a1.x | a1.y | a1.z | a1.w |
> 
>    Dual patch (SIMD4x2 style inputs):
>    g6: | a2.x | a2.y | a2.z | a2.w | a1.x | a1.y | a1.z | a1.w |
>    g7: | b2.x | b2.y | b2.z | b2.w | b1.x | b1.y | b1.z | b1.w |
> 
> This is fairly easy to adjust for - in fact, we can take the FS IR
> we already generated for single-patch mode and transform it to
> create a dual-patch shader.  We only need to repeat register allocation.
> 
> Our load_input handling for TES shaders always emits MOVs to expand
> a scalar input to a full SIMD8 register.  While these may be copy
> propagated away, it does mean all input file access will be a scalar
> <0,1,0> region.  (The FS backend in theory could recognize things
> like dot(input1, input2) and emit a vector DP4 operation, but it does
> not do such things today, nor is it likely to gain such support.)
> 
> Likewise, our URB access reads a single 32-bit URB handle from the
> payload, expanding it to 8 handles for the SIMD8 URB messages.  We
> can adjust the region from <0,1,0> to <1,4,0> on that MOV to replicate
> the first four times, and the second another four times.
> 
> Despite having a register for PrimitiveID, the documentation says
> dual-patch mode is not allowed when PrimitiveID is in use.  So we
> don't need to handle that.
> 
> This should be all that's required.
> 
> Improves performance in Synmark's Gl40TerrainFlyTess at 1024x768 on
> Apollolake 3x6 with single-channel RAM by 0.161727% +/- 0.0686365%
> (n=1062).
> ---
>   src/intel/compiler/brw_compiler.h             |  2 +
>   src/intel/compiler/brw_fs.cpp                 | 98 ++++++++++++++++++++++++++-
>   src/intel/compiler/brw_fs.h                   |  2 +
>   src/intel/compiler/brw_fs_visitor.cpp         |  1 +
>   src/intel/compiler/brw_shader.cpp             |  3 +
>   src/mesa/drivers/dri/i965/genX_state_upload.c |  7 ++
>   6 files changed, 112 insertions(+), 1 deletion(-)
> 
> This is not that impressive in itself, but it seems like it can only
> help...the hardware ought to automatically dispatch in dual mode if
> half the channels in an invocation were going to be wasted.
> 
> diff --git a/src/intel/compiler/brw_compiler.h b/src/intel/compiler/brw_compiler.h
> index b1086bbcee5..311cff5a33d 100644
> --- a/src/intel/compiler/brw_compiler.h
> +++ b/src/intel/compiler/brw_compiler.h
> @@ -986,6 +986,8 @@ struct brw_tes_prog_data
>      enum brw_tess_partitioning partitioning;
>      enum brw_tess_output_topology output_topology;
>      enum brw_tess_domain domain;
> +
> +   uint32_t prog_offset_dual_patch;
>   };
>   
>   struct brw_gs_prog_data
> diff --git a/src/intel/compiler/brw_fs.cpp b/src/intel/compiler/brw_fs.cpp
> index bed632d21b9..02383280932 100644
> --- a/src/intel/compiler/brw_fs.cpp
> +++ b/src/intel/compiler/brw_fs.cpp
> @@ -6278,6 +6278,16 @@ fs_visitor::run_tcs_single_patch()
>      return !failed;
>   }
>   
> +static void
> +copy_fs_inst_list(void *mem_ctx, exec_list *dst, exec_list *src)
> +{
> +   dst->make_empty();
> +
> +   foreach_in_list(fs_inst, inst, src) {
> +      dst->push_tail(new(mem_ctx) fs_inst(*inst));
> +   }
> +}
> +
>   bool
>   fs_visitor::run_tes()
>   {
> @@ -6307,9 +6317,95 @@ fs_visitor::run_tes()
>      assign_tes_urb_setup();
>   
>      fixup_3src_null_dest();
> +
> +   /* The Skylake 3DSTATE_DS documentation says:
> +    *
> +    *    "SIMD8_SINGLE_OR_DUAL_PATCH must not be used if the domain shader
> +    *     kernel uses primitive id."
> +    *
> +    * So, don't bother compiling a dual-patch shader in that case.
> +    */
> +   cfg_t *single_patch_cfg = cfg;
> +   if (devinfo->gen >= 9 &&
> +       (nir->info.system_values_read & SYSTEM_BIT_PRIMITIVE_ID) == 0 &&
> +       !(INTEL_DEBUG & DEBUG_NO_DUAL_OBJECT_GS)) {
> +      dual_patch_cfg = new(mem_ctx) cfg_t(*cfg, copy_fs_inst_list);
> +   }
> +
>      allocate_registers(8, true);
>   
> -   return !failed;
> +   if (failed)
> +      return false;
> +
> +   if (dual_patch_cfg) {
> +      invalidate_live_intervals();
> +
> +      /* Restore the CFG prior to register allocation. */
> +      cfg = dual_patch_cfg;
> +      validate();
> +
> +      struct brw_vue_prog_data *vue_prog_data = brw_vue_prog_data(prog_data);
> +      int old_urb_start = payload.num_regs + prog_data->curb_read_length;
> +
> +      /* Remap payload register access. */
> +      foreach_block_and_inst(block, fs_inst, inst, dual_patch_cfg) {
> +         for (int i = 0; i < inst->sources; i++) {
> +            if (inst->src[i].file != FIXED_GRF ||
> +                inst->src[i].nr >= first_non_payload_grf)
> +               continue;
> +
> +            struct brw_reg &reg = inst->src[i].as_brw_reg();
> +
> +            if (reg.nr == 0 && reg.subnr == 0) {
> +               /* The patch URB handles are g0<1,4,0>, not g0<0,1,0>. */
> +               if (has_scalar_region(reg))
> +                  reg = stride(reg, 1, 4, 0);
> +            } else if (reg.nr >= 1 && reg.nr < old_urb_start) {
> +               /* gl_TessCoord, URB return handles, and push constants
> +                * move up one register to account for g1 being PrimitiveID.
> +                */
> +               reg.nr++;
> +            } else {
> +               assert(reg.nr >= old_urb_start);
> +               assert(has_scalar_region(reg));
> +               assert(reg.subnr / 16 <= 1);
> +
> +               reg = stride(reg, 16 / type_sz(reg.type), 4, 0);
> +
> +               int attr = reg.nr - old_urb_start;
> +               reg.nr = 1 + old_urb_start + (2 * attr) + (reg.subnr / 16);
> +               reg.subnr %= 16;
> +
> +               if (inst->is_3src(devinfo)) {
> +                  int c = reg.subnr / 4;
> +                  reg.swizzle = BRW_SWIZZLE4(c, c, c, c);
> +                  reg.subnr = 0;
> +               }
> +            }
> +         }
> +      }
> +
> +      /* Adjust the first non-payload GRF for the extra input registers
> +       * as well as the extra g1 (PrimitiveID) payload register.  This
> +       * ensures that the register allocator leaves those be.
> +       *
> +       * It would make sense to increment payload.num_regs for the extra
> +       * g1 (PrimitiveID) register, but we actually want to leave it alone
> +       * so prog_data->dispatch_grf_start_reg does not include it.  The
> +       * hardware automatically increments that value by 1 for dual-patch
> +       * shaders to account for it.
> +       */
> +      first_non_payload_grf += 1 + 8 * vue_prog_data->urb_read_length;
> +
> +      allocate_registers(8, true);
> +
> +      if (failed)
> +         dual_patch_cfg = NULL;
> +
> +      cfg = single_patch_cfg;
> +   }
> +
> +   return true;
>   }
>   
>   bool
> diff --git a/src/intel/compiler/brw_fs.h b/src/intel/compiler/brw_fs.h
> index 63373580ee4..6df0678ee0b 100644
> --- a/src/intel/compiler/brw_fs.h
> +++ b/src/intel/compiler/brw_fs.h
> @@ -374,6 +374,8 @@ public:
>   
>      unsigned promoted_constants;
>      brw::fs_builder bld;
> +
> +   cfg_t *dual_patch_cfg;
>   };
>   
>   /**
> diff --git a/src/intel/compiler/brw_fs_visitor.cpp b/src/intel/compiler/brw_fs_visitor.cpp
> index 7a5f6451f2b..e8bc8c6d255 100644
> --- a/src/intel/compiler/brw_fs_visitor.cpp
> +++ b/src/intel/compiler/brw_fs_visitor.cpp
> @@ -900,6 +900,7 @@ fs_visitor::init()
>   
>      this->grf_used = 0;
>      this->spilled_any_registers = false;
> +   this->dual_patch_cfg = NULL;
>   }
>   
>   fs_visitor::~fs_visitor()
> diff --git a/src/intel/compiler/brw_shader.cpp b/src/intel/compiler/brw_shader.cpp
> index 1df4f35cd8e..831f40c290e 100644
> --- a/src/intel/compiler/brw_shader.cpp
> +++ b/src/intel/compiler/brw_shader.cpp
> @@ -1264,6 +1264,9 @@ brw_compile_tes(const struct brw_compiler *compiler,
>   
>         g.generate_code(v.cfg, 8);
>   
> +      prog_data->prog_offset_dual_patch =
> +         v.dual_patch_cfg ? g.generate_code(v.dual_patch_cfg, 8) : 0;
> +
>         assembly = g.get_assembly(&prog_data->base.base.program_size);
>      } else {
>         brw::vec4_tes_visitor v(compiler, log_data, key, prog_data,
> diff --git a/src/mesa/drivers/dri/i965/genX_state_upload.c b/src/mesa/drivers/dri/i965/genX_state_upload.c
> index 8668abd591f..ceef11d46e1 100644
> --- a/src/mesa/drivers/dri/i965/genX_state_upload.c
> +++ b/src/mesa/drivers/dri/i965/genX_state_upload.c
> @@ -3984,6 +3984,13 @@ genX(upload_ds_state)(struct brw_context *brw)
>              ds.DispatchMode = DISPATCH_MODE_SIMD8_SINGLE_PATCH;
>           ds.UserClipDistanceCullTestEnableBitmask =
>               vue_prog_data->cull_distance_mask;
> +#endif
> +#if GEN_GEN >= 9
> +        if (tes_prog_data->prog_offset_dual_patch > 0) {
> +           ds.DispatchMode = DISPATCH_MODE_SIMD8_SINGLE_OR_DUAL_PATCH;
> +           ds.DUAL_PATCHKernelStartPointer = stage_state->prog_offset +
> +              tes_prog_data->prog_offset_dual_patch;
> +        }
>   #endif
>         }
>      }
> 

-------------- next part --------------
A non-text attachment was scrubbed...
Name: dual-single-patch-shader-names.patch
Type: text/x-patch
Size: 1502 bytes
Desc: not available
URL: <https://lists.freedesktop.org/archives/mesa-dev/attachments/20180223/4f6d10bd/attachment-0001.bin>