[Mesa-dev] [PATCH 4/4] i965: Implement dual-patch tessellation evaluation shaders.
Eero Tamminen
eero.t.tamminen at intel.com
Fri Feb 23 09:42:28 UTC 2018
Hi,
Could you also give dual & single shaders different names when dumping
them, e.g. something like the attached patch?
- Eero
On 23.02.2018 10:36, Kenneth Graunke wrote:
> Normally, SIMD8 tessellation evaluation shaders operate on a single
> patch, with each channel operating on a different vertex within the
> patch. For patch primitives with fewer than 8 vertices, this means
> that some of the channels are disabled, effectively wasting compute
> power. This is a fairly common case.
>
> To solve this, Skylake and later add "dual-patch" domain shader support.
> Dual-patch mode only supports PATCHLIST_1..4. The first four channels
> process vertices in the first patch, while the second four channels
> process vertices from a second patch. This can double the throughput.
>
> Similar to pixel shader SIMD8 vs. SIMD16 handling, 3DSTATE_DS accepts
> two KSPs - one for single-patch mode, and one for dual-patch mode. The
> hardware will dynamically dispatch whichever kernel makes sense.
>
> The two shader modes are nearly identical. The three differences are:
>
> - There are two URB handles instead of one.
> - Dual-patch has an extra payload register (g1) for PrimitiveID.
> (single-patch packs it in g0, but they couldn't fit two IDs there)
> - Pushed input data arrives interleaved rather than packed (in SIMD4x2
> style rather than SIMD4x1 style). For example, varyings 'a' and 'b'
> would show up as follows:
>
> Single patch (SIMD4x1 style inputs):
> g6: | b1.x | b1.y | b1.z | b1.w | a1.x | a1.y | a1.z | a1.w |
>
> Dual patch (SIMD4x2 style inputs):
> g6: | a2.x | a2.y | a2.z | a2.w | a1.x | a1.y | a1.z | a1.w |
> g7: | b2.x | b2.y | b2.z | b2.w | b1.x | b1.y | b1.z | b1.w |
>
> This is fairly easy to adjust for - in fact, we can take the FS IR
> we already generated for single-patch mode and transform it to
> create a dual-patch shader. We only need to repeat register allocation.
>
> Our load_input handling for TES shaders always emits MOVs to expand
> a scalar input to a full SIMD8 register. While these may be copy
> propagated away, it does mean all input file access will be a scalar
> <0,1,0> region. (The FS backend in theory could recognize things
> like dot(input1, input2) and emit a vector DP4 operation, but it does
> not do such things today, nor is it likely to gain such support.)
>
> Likewise, our URB access reads a single 32-bit URB handle from the
> payload, expanding it to 8 handles for the SIMD8 URB messages. We
> can adjust the region from <0,1,0> to <1,4,0> on that MOV to replicate
> the first four times, and the second another four times.
>
> Despite having a register for PrimitiveID, the documentation says
> dual-patch mode is not allowed when PrimitiveID is in use. So we
> don't need to handle that.
>
> This should be all that's required.
>
> Improves performance in Synmark's Gl40TerrainFlyTess at 1024x768 on
> Apollolake 3x6 with single-channel RAM by 0.161727% +/- 0.0686365%
> (n=1062).
> ---
> src/intel/compiler/brw_compiler.h | 2 +
> src/intel/compiler/brw_fs.cpp | 98 ++++++++++++++++++++++++++-
> src/intel/compiler/brw_fs.h | 2 +
> src/intel/compiler/brw_fs_visitor.cpp | 1 +
> src/intel/compiler/brw_shader.cpp | 3 +
> src/mesa/drivers/dri/i965/genX_state_upload.c | 7 ++
> 6 files changed, 112 insertions(+), 1 deletion(-)
>
> This is not that impressive in itself, but it seems like it can only
> help...the hardware ought to automatically dispatch in dual mode if
> half the channels in an invocation were going to be wasted.
>
> diff --git a/src/intel/compiler/brw_compiler.h b/src/intel/compiler/brw_compiler.h
> index b1086bbcee5..311cff5a33d 100644
> --- a/src/intel/compiler/brw_compiler.h
> +++ b/src/intel/compiler/brw_compiler.h
> @@ -986,6 +986,8 @@ struct brw_tes_prog_data
> enum brw_tess_partitioning partitioning;
> enum brw_tess_output_topology output_topology;
> enum brw_tess_domain domain;
> +
> + uint32_t prog_offset_dual_patch;
> };
>
> struct brw_gs_prog_data
> diff --git a/src/intel/compiler/brw_fs.cpp b/src/intel/compiler/brw_fs.cpp
> index bed632d21b9..02383280932 100644
> --- a/src/intel/compiler/brw_fs.cpp
> +++ b/src/intel/compiler/brw_fs.cpp
> @@ -6278,6 +6278,16 @@ fs_visitor::run_tcs_single_patch()
> return !failed;
> }
>
> +static void
> +copy_fs_inst_list(void *mem_ctx, exec_list *dst, exec_list *src)
> +{
> + dst->make_empty();
> +
> + foreach_in_list(fs_inst, inst, src) {
> + dst->push_tail(new(mem_ctx) fs_inst(*inst));
> + }
> +}
> +
> bool
> fs_visitor::run_tes()
> {
> @@ -6307,9 +6317,95 @@ fs_visitor::run_tes()
> assign_tes_urb_setup();
>
> fixup_3src_null_dest();
> +
> + /* The Skylake 3DSTATE_DS documentation says:
> + *
> + * "SIMD8_SINGLE_OR_DUAL_PATCH must not be used if the domain shader
> + * kernel uses primitive id."
> + *
> + * So, don't bother compiling a dual-patch shader in that case.
> + */
> + cfg_t *single_patch_cfg = cfg;
> + if (devinfo->gen >= 9 &&
> + (nir->info.system_values_read & SYSTEM_BIT_PRIMITIVE_ID) == 0 &&
> + !(INTEL_DEBUG & DEBUG_NO_DUAL_OBJECT_GS)) {
> + dual_patch_cfg = new(mem_ctx) cfg_t(*cfg, copy_fs_inst_list);
> + }
> +
> allocate_registers(8, true);
>
> - return !failed;
> + if (failed)
> + return false;
> +
> + if (dual_patch_cfg) {
> + invalidate_live_intervals();
> +
> + /* Restore the CFG prior to register allocation. */
> + cfg = dual_patch_cfg;
> + validate();
> +
> + struct brw_vue_prog_data *vue_prog_data = brw_vue_prog_data(prog_data);
> + int old_urb_start = payload.num_regs + prog_data->curb_read_length;
> +
> + /* Remap payload register access. */
> + foreach_block_and_inst(block, fs_inst, inst, dual_patch_cfg) {
> + for (int i = 0; i < inst->sources; i++) {
> + if (inst->src[i].file != FIXED_GRF ||
> + inst->src[i].nr >= first_non_payload_grf)
> + continue;
> +
> + struct brw_reg ® = inst->src[i].as_brw_reg();
> +
> + if (reg.nr == 0 && reg.subnr == 0) {
> + /* The patch URB handles are g0<1,4,0>, not g0<0,1,0>. */
> + if (has_scalar_region(reg))
> + reg = stride(reg, 1, 4, 0);
> + } else if (reg.nr >= 1 && reg.nr < old_urb_start) {
> + /* gl_TessCoord, URB return handles, and push constants
> + * move up one register to account for g1 being PrimitiveID.
> + */
> + reg.nr++;
> + } else {
> + assert(reg.nr >= old_urb_start);
> + assert(has_scalar_region(reg));
> + assert(reg.subnr / 16 <= 1);
> +
> + reg = stride(reg, 16 / type_sz(reg.type), 4, 0);
> +
> + int attr = reg.nr - old_urb_start;
> + reg.nr = 1 + old_urb_start + (2 * attr) + (reg.subnr / 16);
> + reg.subnr %= 16;
> +
> + if (inst->is_3src(devinfo)) {
> + int c = reg.subnr / 4;
> + reg.swizzle = BRW_SWIZZLE4(c, c, c, c);
> + reg.subnr = 0;
> + }
> + }
> + }
> + }
> +
> + /* Adjust the first non-payload GRF for the extra input registers
> + * as well as the extra g1 (PrimitiveID) payload register. This
> + * ensures that the register allocator leaves those be.
> + *
> + * It would make sense to increment payload.num_regs for the extra
> + * g1 (PrimitiveID) register, but we actually want to leave it alone
> + * so prog_data->dispatch_grf_start_reg does not include it. The
> + * hardware automatically increments that value by 1 for dual-patch
> + * shaders to account for it.
> + */
> + first_non_payload_grf += 1 + 8 * vue_prog_data->urb_read_length;
> +
> + allocate_registers(8, true);
> +
> + if (failed)
> + dual_patch_cfg = NULL;
> +
> + cfg = single_patch_cfg;
> + }
> +
> + return true;
> }
>
> bool
> diff --git a/src/intel/compiler/brw_fs.h b/src/intel/compiler/brw_fs.h
> index 63373580ee4..6df0678ee0b 100644
> --- a/src/intel/compiler/brw_fs.h
> +++ b/src/intel/compiler/brw_fs.h
> @@ -374,6 +374,8 @@ public:
>
> unsigned promoted_constants;
> brw::fs_builder bld;
> +
> + cfg_t *dual_patch_cfg;
> };
>
> /**
> diff --git a/src/intel/compiler/brw_fs_visitor.cpp b/src/intel/compiler/brw_fs_visitor.cpp
> index 7a5f6451f2b..e8bc8c6d255 100644
> --- a/src/intel/compiler/brw_fs_visitor.cpp
> +++ b/src/intel/compiler/brw_fs_visitor.cpp
> @@ -900,6 +900,7 @@ fs_visitor::init()
>
> this->grf_used = 0;
> this->spilled_any_registers = false;
> + this->dual_patch_cfg = NULL;
> }
>
> fs_visitor::~fs_visitor()
> diff --git a/src/intel/compiler/brw_shader.cpp b/src/intel/compiler/brw_shader.cpp
> index 1df4f35cd8e..831f40c290e 100644
> --- a/src/intel/compiler/brw_shader.cpp
> +++ b/src/intel/compiler/brw_shader.cpp
> @@ -1264,6 +1264,9 @@ brw_compile_tes(const struct brw_compiler *compiler,
>
> g.generate_code(v.cfg, 8);
>
> + prog_data->prog_offset_dual_patch =
> + v.dual_patch_cfg ? g.generate_code(v.dual_patch_cfg, 8) : 0;
> +
> assembly = g.get_assembly(&prog_data->base.base.program_size);
> } else {
> brw::vec4_tes_visitor v(compiler, log_data, key, prog_data,
> diff --git a/src/mesa/drivers/dri/i965/genX_state_upload.c b/src/mesa/drivers/dri/i965/genX_state_upload.c
> index 8668abd591f..ceef11d46e1 100644
> --- a/src/mesa/drivers/dri/i965/genX_state_upload.c
> +++ b/src/mesa/drivers/dri/i965/genX_state_upload.c
> @@ -3984,6 +3984,13 @@ genX(upload_ds_state)(struct brw_context *brw)
> ds.DispatchMode = DISPATCH_MODE_SIMD8_SINGLE_PATCH;
> ds.UserClipDistanceCullTestEnableBitmask =
> vue_prog_data->cull_distance_mask;
> +#endif
> +#if GEN_GEN >= 9
> + if (tes_prog_data->prog_offset_dual_patch > 0) {
> + ds.DispatchMode = DISPATCH_MODE_SIMD8_SINGLE_OR_DUAL_PATCH;
> + ds.DUAL_PATCHKernelStartPointer = stage_state->prog_offset +
> + tes_prog_data->prog_offset_dual_patch;
> + }
> #endif
> }
> }
>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: dual-single-patch-shader-names.patch
Type: text/x-patch
Size: 1502 bytes
Desc: not available
URL: <https://lists.freedesktop.org/archives/mesa-dev/attachments/20180223/4f6d10bd/attachment-0001.bin>
More information about the mesa-dev
mailing list