<div dir="ltr"><div dir="ltr"><br></div><br><div class="gmail_quote"><div dir="ltr" class="gmail_attr">On Fri, Aug 9, 2019 at 7:22 PM Francisco Jerez <<a href="mailto:currojerez@riseup.net">currojerez@riseup.net</a>> wrote:<br></div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">See "i965/gen9: Optimize slice and subslice load balancing behavior."<br> for the rationale. Marked optional because no performance evaluation<br> has been done on this commit, it is provided to match the hashing<br> settings of the Iris driver. Test reports welcome.<br></blockquote><div><br></div><div>Improves Aztec Ruins by 2.7% (tested 5 times in each configuration). I also tested Batman: Arkham City but it only reports integer FPS numbers and no change was noticable.<br></div><div> </div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex"> ---<br> src/intel/vulkan/anv_genX.h | 4 ++<br> src/intel/vulkan/anv_private.h | 6 ++<br> src/intel/vulkan/genX_blorp_exec.c | 6 ++<br> src/intel/vulkan/genX_cmd_buffer.c | 96 ++++++++++++++++++++++++++++++<br> 4 files changed, 112 insertions(+)<br> <br> diff --git a/src/intel/vulkan/anv_genX.h b/src/intel/vulkan/anv_genX.h<br> index a5435e566a3..06c6b467acf 100644<br> --- a/src/intel/vulkan/anv_genX.h<br> +++ b/src/intel/vulkan/anv_genX.h<br> @@ -44,6 +44,10 @@ void genX(cmd_buffer_apply_pipe_flushes)(struct anv_cmd_buffer *cmd_buffer);<br> <br> void genX(cmd_buffer_emit_gen7_depth_flush)(struct anv_cmd_buffer *cmd_buffer);<br> <br> +void genX(cmd_buffer_emit_hashing_mode)(struct anv_cmd_buffer *cmd_buffer,<br> + unsigned width, unsigned height,<br> + unsigned scale);<br> +<br> void genX(flush_pipeline_select_3d)(struct anv_cmd_buffer *cmd_buffer);<br> void genX(flush_pipeline_select_gpgpu)(struct anv_cmd_buffer *cmd_buffer);<br> <br> diff --git a/src/intel/vulkan/anv_private.h b/src/intel/vulkan/anv_private.h<br> index 2465f264354..b381386a716 100644<br> --- a/src/intel/vulkan/anv_private.h<br> +++ b/src/intel/vulkan/anv_private.h<br> @@ -2421,6 +2421,12 @@ struct anv_cmd_state {<br> <br> bool conditional_render_enabled;<br> <br> + /**<br> + * Last rendering scale argument provided to<br> + * genX(cmd_buffer_emit_hashing_mode)().<br> + */<br> + unsigned current_hash_scale;<br> +<br> /**<br> * Array length is anv_cmd_state::pass::attachment_count. Array content is<br> * valid only when recording a render pass instance.<br> diff --git a/src/intel/vulkan/genX_blorp_exec.c b/src/intel/vulkan/genX_blorp_exec.c<br> index 1592e7f7e3d..e9eedc06696 100644<br> --- a/src/intel/vulkan/genX_blorp_exec.c<br> +++ b/src/intel/vulkan/genX_blorp_exec.c<br> @@ -223,6 +223,12 @@ genX(blorp_exec)(struct blorp_batch *batch,<br> genX(cmd_buffer_config_l3)(cmd_buffer, cfg);<br> }<br> <br> + const unsigned scale = params->fast_clear_op ? UINT_MAX : 1;<br> + if (cmd_buffer->state.current_hash_scale != scale) {<br> + genX(cmd_buffer_emit_hashing_mode)(cmd_buffer, params->x1 - params->x0,<br> + params->y1 - params->y0, scale);<br> + }<br> +<br> #if GEN_GEN >= 11<br> /* The PIPE_CONTROL command description says:<br> *<br> diff --git a/src/intel/vulkan/genX_cmd_buffer.c b/src/intel/vulkan/genX_cmd_buffer.c<br> index 86ef1663ac4..e9e5570d49f 100644<br> --- a/src/intel/vulkan/genX_cmd_buffer.c<br> +++ b/src/intel/vulkan/genX_cmd_buffer.c<br> @@ -1595,6 +1595,7 @@ genX(CmdExecuteCommands)(<br> */<br> primary->state.current_pipeline = UINT32_MAX;<br> primary->state.current_l3_config = NULL;<br> + primary->state.current_hash_scale = 0;<br> <br> /* Each of the secondary command buffers will use its own state base<br> * address. We need to re-emit state base address for the primary after<br> @@ -2663,6 +2664,9 @@ genX(cmd_buffer_flush_state)(struct anv_cmd_buffer *cmd_buffer)<br> <br> genX(cmd_buffer_config_l3)(cmd_buffer, pipeline->urb.l3_config);<br> <br> + if (cmd_buffer->state.current_hash_scale != 1)<br> + genX(cmd_buffer_emit_hashing_mode)(cmd_buffer, UINT_MAX, UINT_MAX, 1);<br></blockquote><div><br></div><div>Can we do the check against current_hash_scale as an early return inside the emit function since that's already where the "current" value is updated? Maybe call it cmd_buffer_config_hashing_mode if "emit" no longer seems right with that change? I don't really care about the rename.<br></div><div> </div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex"> +<br> genX(flush_pipeline_select_3d)(cmd_buffer);<br> <br> if (vb_emit) {<br> @@ -3925,6 +3929,98 @@ genX(cmd_buffer_emit_gen7_depth_flush)(struct anv_cmd_buffer *cmd_buffer)<br> }<br> }<br> <br> +/**<br> + * Update the pixel hashing modes that determine the balancing of PS threads<br> + * across subslices and slices.<br> + *<br> + * \param width Width bound of the rendering area (already scaled down if \p<br> + * scale is greater than 1).<br> + * \param height Height bound of the rendering area (already scaled down if \p<br> + * scale is greater than 1).<br> + * \param scale The number of framebuffer samples that could potentially be<br> + * affected by an individual channel of the PS thread. This is<br> + * typically one for single-sampled rendering, but for operations<br> + * like CCS resolves and fast clears a single PS invocation may<br> + * update a huge number of pixels, in which case a finer<br> + * balancing is desirable in order to maximally utilize the<br> + * bandwidth available. UINT_MAX can be used as shorthand for<br> + * "finest hashing mode available".<br> + */<br> +void<br> +genX(cmd_buffer_emit_hashing_mode)(struct anv_cmd_buffer *cmd_buffer,<br> + unsigned width, unsigned height,<br> + unsigned scale)<br> +{<br> +#if GEN_GEN == 9<br> + const struct gen_device_info *devinfo = &cmd_buffer->device->info;<br> + const unsigned slice_hashing[] = {<br> + /* Because all Gen9 platforms with more than one slice require<br> + * three-way subslice hashing, a single "normal" 16x16 slice hashing<br> + * block is guaranteed to suffer from substantial imbalance, with one<br> + * subslice receiving twice as much work as the other two in the<br> + * slice.<br> + *<br> + * The performance impact of that would be particularly severe when<br> + * three-way hashing is also in use for slice balancing (which is the<br> + * case for all Gen9 GT4 platforms), because one of the slices<br> + * receives one every three 16x16 blocks in either direction, which<br> + * is roughly the periodicity of the underlying subslice imbalance<br> + * pattern ("roughly" because in reality the hardware's<br> + * implementation of three-way hashing doesn't do exact modulo 3<br> + * arithmetic, which somewhat decreases the magnitude of this effect<br> + * in practice). This leads to a systematic subslice imbalance<br> + * within that slice regardless of the size of the primitive. The<br> + * 32x32 hashing mode guarantees that the subslice imbalance within a<br> + * single slice hashing block is minimal, largely eliminating this<br> + * effect.<br> + */<br> + _32x32,<br> + /* Finest slice hashing mode available. */<br> + NORMAL<br> + };<br> + const unsigned subslice_hashing[] = {<br> + /* 16x16 would provide a slight cache locality benefit especially<br> + * visible in the sampler L1 cache efficiency of low-bandwidth<br> + * non-LLC platforms, but it comes at the cost of greater subslice<br> + * imbalance for primitives of dimensions approximately intermediate<br> + * between 16x4 and 16x16.<br> + */<br> + _16x4,<br> + /* Finest subslice hashing mode available. */<br> + _8x4<br> + };<br> + /* Dimensions of the smallest hashing block of a given hashing mode. If<br> + * the rendering area is smaller than this there can't possibly be any<br> + * benefit from switching to this mode, so we optimize out the<br> + * transition.<br> + */<br> + const unsigned min_size[][2] = {<br> + { 16, 4 },<br> + { 8, 4 }<br> + };<br> + const unsigned idx = scale > 1;<br> +<br> + if (width > min_size[idx][0] || height > min_size[idx][1]) {<br> + uint32_t gt_mode;<br> +<br> + anv_pack_struct(&gt_mode, GENX(GT_MODE),<br> + .SliceHashing = (devinfo->num_slices > 1 ? slice_hashing[idx] : 0),<br> + .SliceHashingMask = (devinfo->num_slices > 1 ? -1 : 0),<br> + .SubsliceHashing = subslice_hashing[idx],<br> + .SubsliceHashingMask = -1);<br> +<br> + anv_batch_emit(&cmd_buffer->batch, GENX(PIPE_CONTROL), pc) {<br> + pc.StallAtPixelScoreboard = true;<br> + pc.CommandStreamerStallEnable = true;<br> + }<br></blockquote><div><br></div><div>This would be better as<br></div><div><br></div><div>cmd_buffer->state.pending_pipe_bits |= ANV_PIPE_CS_STALL_BIT;<br>genX(cmd_buffer_apply_pipe_flushes)(cmd_buffer);</div><div><br></div><div>That way workarounds get applied more properly and we can avoid redundant PIPE_CONTROLs. apply_pipe_flushes will insert the StallAtPixelScoreboard for you if needed. That said, maybe it's best to also OR in ANV_PIPE_STALL_AT_SCOREBOARD_BIT just to be sure.</div><div><br></div><div>Other than that, the patch looks good to me. Happy to see the perf bump. :)</div><div><br></div><div>I did consider maybe trying to move things around so that the change is done more as a subpass thing but there are some subtle interactions to think through there and I'm convinced what you have here is correct so let's go with it for now.</div><div><br></div><div>--Jason<br></div><div> </div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex"> +<br> + emit_lri(&cmd_buffer->batch, GENX(GT_MODE_num), gt_mode);<br> +<br> + cmd_buffer->state.current_hash_scale = scale;<br> + }<br> +#endif<br> +}<br> +<br> static void<br> cmd_buffer_emit_depth_stencil(struct anv_cmd_buffer *cmd_buffer)<br> {<br> -- <br> 2.22.0<br> <br> _______________________________________________<br> mesa-dev mailing list<br> <a href="mailto:mesa-dev@lists.freedesktop.org" target="_blank">mesa-dev@lists.freedesktop.org</a><br> <a href="https://lists.freedesktop.org/mailman/listinfo/mesa-dev" rel="noreferrer" target="_blank">https://lists.freedesktop.org/mailman/listinfo/mesa-dev</a></blockquote></div></div>