<div dir="ltr"><div class="gmail_quote"><div dir="ltr" class="gmail_attr">On Sat, Aug 10, 2019 at 4:55 PM Francisco Jerez <<a href="mailto:currojerez@riseup.net">currojerez@riseup.net</a>> wrote:<br></div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">Jason Ekstrand <<a href="mailto:jason@jlekstrand.net" target="_blank">jason@jlekstrand.net</a>> writes:<br>
<br>
> On Sat, Aug 10, 2019 at 2:22 PM Francisco Jerez <<a href="mailto:currojerez@riseup.net" target="_blank">currojerez@riseup.net</a>><br>
> wrote:<br>
><br>
>> Jason Ekstrand <<a href="mailto:jason@jlekstrand.net" target="_blank">jason@jlekstrand.net</a>> writes:<br>
>><br>
>> > On Fri, Aug 9, 2019 at 7:22 PM Francisco Jerez <<a href="mailto:currojerez@riseup.net" target="_blank">currojerez@riseup.net</a>><br>
>> > wrote:<br>
>> ><br>
>> >> See "i965/gen9: Optimize slice and subslice load balancing behavior."<br>
>> >> for the rationale. Marked optional because no performance evaluation<br>
>> >> has been done on this commit, it is provided to match the hashing<br>
>> >> settings of the Iris driver. Test reports welcome.<br>
>> >><br>
>> ><br>
>> > Improves Aztec Ruins by 2.7% (tested 5 times in each configuration). I<br>
>> > also tested Batman: Arkham City but it only reports integer FPS numbers<br>
>> and<br>
>> > no change was noticable.<br>
>> ><br>
>> ><br>
>> >> ---<br>
>> >> src/intel/vulkan/anv_genX.h | 4 ++<br>
>> >> src/intel/vulkan/anv_private.h | 6 ++<br>
>> >> src/intel/vulkan/genX_blorp_exec.c | 6 ++<br>
>> >> src/intel/vulkan/genX_cmd_buffer.c | 96 ++++++++++++++++++++++++++++++<br>
>> >> 4 files changed, 112 insertions(+)<br>
>> >><br>
>> >> diff --git a/src/intel/vulkan/anv_genX.h b/src/intel/vulkan/anv_genX.h<br>
>> >> index a5435e566a3..06c6b467acf 100644<br>
>> >> --- a/src/intel/vulkan/anv_genX.h<br>
>> >> +++ b/src/intel/vulkan/anv_genX.h<br>
>> >> @@ -44,6 +44,10 @@ void genX(cmd_buffer_apply_pipe_flushes)(struct<br>
>> >> anv_cmd_buffer *cmd_buffer);<br>
>> >><br>
>> >> void genX(cmd_buffer_emit_gen7_depth_flush)(struct anv_cmd_buffer<br>
>> >> *cmd_buffer);<br>
>> >><br>
>> >> +void genX(cmd_buffer_emit_hashing_mode)(struct anv_cmd_buffer<br>
>> *cmd_buffer,<br>
>> >> + unsigned width, unsigned<br>
>> height,<br>
>> >> + unsigned scale);<br>
>> >> +<br>
>> >> void genX(flush_pipeline_select_3d)(struct anv_cmd_buffer *cmd_buffer);<br>
>> >> void genX(flush_pipeline_select_gpgpu)(struct anv_cmd_buffer<br>
>> *cmd_buffer);<br>
>> >><br>
>> >> diff --git a/src/intel/vulkan/anv_private.h<br>
>> >> b/src/intel/vulkan/anv_private.h<br>
>> >> index 2465f264354..b381386a716 100644<br>
>> >> --- a/src/intel/vulkan/anv_private.h<br>
>> >> +++ b/src/intel/vulkan/anv_private.h<br>
>> >> @@ -2421,6 +2421,12 @@ struct anv_cmd_state {<br>
>> >><br>
>> >> bool<br>
>> >> conditional_render_enabled;<br>
>> >><br>
>> >> + /**<br>
>> >> + * Last rendering scale argument provided to<br>
>> >> + * genX(cmd_buffer_emit_hashing_mode)().<br>
>> >> + */<br>
>> >> + unsigned current_hash_scale;<br>
>> >> +<br>
>> >> /**<br>
>> >> * Array length is anv_cmd_state::pass::attachment_count. Array<br>
>> >> content is<br>
>> >> * valid only when recording a render pass instance.<br>
>> >> diff --git a/src/intel/vulkan/genX_blorp_exec.c<br>
>> >> b/src/intel/vulkan/genX_blorp_exec.c<br>
>> >> index 1592e7f7e3d..e9eedc06696 100644<br>
>> >> --- a/src/intel/vulkan/genX_blorp_exec.c<br>
>> >> +++ b/src/intel/vulkan/genX_blorp_exec.c<br>
>> >> @@ -223,6 +223,12 @@ genX(blorp_exec)(struct blorp_batch *batch,<br>
>> >> genX(cmd_buffer_config_l3)(cmd_buffer, cfg);<br>
>> >> }<br>
>> >><br>
>> >> + const unsigned scale = params->fast_clear_op ? UINT_MAX : 1;<br>
>> >> + if (cmd_buffer->state.current_hash_scale != scale) {<br>
>> >> + genX(cmd_buffer_emit_hashing_mode)(cmd_buffer, params->x1 -<br>
>> >> params->x0,<br>
>> >> + params->y1 - params->y0,<br>
>> scale);<br>
>> >> + }<br>
>> >> +<br>
>> >> #if GEN_GEN >= 11<br>
>> >> /* The PIPE_CONTROL command description says:<br>
>> >> *<br>
>> >> diff --git a/src/intel/vulkan/genX_cmd_buffer.c<br>
>> >> b/src/intel/vulkan/genX_cmd_buffer.c<br>
>> >> index 86ef1663ac4..e9e5570d49f 100644<br>
>> >> --- a/src/intel/vulkan/genX_cmd_buffer.c<br>
>> >> +++ b/src/intel/vulkan/genX_cmd_buffer.c<br>
>> >> @@ -1595,6 +1595,7 @@ genX(CmdExecuteCommands)(<br>
>> >> */<br>
>> >> primary->state.current_pipeline = UINT32_MAX;<br>
>> >> primary->state.current_l3_config = NULL;<br>
>> >> + primary->state.current_hash_scale = 0;<br>
>> >><br>
>> >> /* Each of the secondary command buffers will use its own state base<br>
>> >> * address. We need to re-emit state base address for the primary<br>
>> >> after<br>
>> >> @@ -2663,6 +2664,9 @@ genX(cmd_buffer_flush_state)(struct anv_cmd_buffer<br>
>> >> *cmd_buffer)<br>
>> >><br>
>> >> genX(cmd_buffer_config_l3)(cmd_buffer, pipeline->urb.l3_config);<br>
>> >><br>
>> >> + if (cmd_buffer->state.current_hash_scale != 1)<br>
>> >> + genX(cmd_buffer_emit_hashing_mode)(cmd_buffer, UINT_MAX,<br>
>> UINT_MAX,<br>
>> >> 1);<br>
>> >><br>
>> ><br>
>> > Can we do the check against current_hash_scale as an early return inside<br>
>> > the emit function since that's already where the "current" value is<br>
>> > updated? Maybe call it cmd_buffer_config_hashing_mode if "emit" no<br>
>> longer<br>
>> > seems right with that change? I don't really care about the rename.<br>
>> ><br>
>><br>
>> That's what my first revision did. I moved the current_hash_scale check<br>
>> one level up because the function call overhead was visible in<br>
>> brw_upload_pipeline_state(), which is quite a hot path in i965.<br>
><br>
><br>
> That's a bit surprising but believable if the two are in different files<br>
> and there's a real function call going on there.<br>
><br>
><br>
>> That<br>
>> said I don't have any data justifying the change in the Vulkan driver --<br>
>> Because caller and callee are defined in the same compilation unit it<br>
>> *might* be that the function is inlined into<br>
>> genX(cmd_buffer_flush_state), though it's large enough that it may not<br>
>><br>
><br>
> I highly doubt it would be a problem. It's a lot of lines of code but once<br>
> you optimize things a bit, it's actually quite a small function.<br>
><br>
><br>
>> be, either way it seemed like a good idea to do things consistently<br>
>> across drivers.<br>
>><br>
><br>
> The functions right next to it for l3$ configuration and PIPELINE_SELECT<br>
> are both called unconditionally. IMO, consistency within the Vulkan driver<br>
> is more important than consistency between iris and ANV given that they<br>
> have completely different state tracking models.<br>
><br>
<br>
I don't really care one way or another since I have no data justifying<br>
the micro-optimization in Vulkan (I do have data showing a statistically<br>
significant benefit in CPU-bound tests from the corresponding change on<br>
i965). Will gladly undo the change in exchange for an R-b.<br></blockquote><div><br></div><div>Sounds fine with me. Just to be clear, were you expecting a conditional review or were you planning to send a v2? I was waiting for a v2.<br></div><div> </div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">
> --Jason<br>
><br>
><br>
>> ><br>
>> >> +<br>
>> >> genX(flush_pipeline_select_3d)(cmd_buffer);<br>
>> >><br>
>> >> if (vb_emit) {<br>
>> >> @@ -3925,6 +3929,98 @@ genX(cmd_buffer_emit_gen7_depth_flush)(struct<br>
>> >> anv_cmd_buffer *cmd_buffer)<br>
>> >> }<br>
>> >> }<br>
>> >><br>
>> >> +/**<br>
>> >> + * Update the pixel hashing modes that determine the balancing of PS<br>
>> >> threads<br>
>> >> + * across subslices and slices.<br>
>> >> + *<br>
>> >> + * \param width Width bound of the rendering area (already scaled down<br>
>> if<br>
>> >> \p<br>
>> >> + * scale is greater than 1).<br>
>> >> + * \param height Height bound of the rendering area (already scaled<br>
>> down<br>
>> >> if \p<br>
>> >> + * scale is greater than 1).<br>
>> >> + * \param scale The number of framebuffer samples that could<br>
>> potentially<br>
>> >> be<br>
>> >> + * affected by an individual channel of the PS thread.<br>
>> This<br>
>> >> is<br>
>> >> + * typically one for single-sampled rendering, but for<br>
>> >> operations<br>
>> >> + * like CCS resolves and fast clears a single PS<br>
>> invocation<br>
>> >> may<br>
>> >> + * update a huge number of pixels, in which case a finer<br>
>> >> + * balancing is desirable in order to maximally utilize<br>
>> the<br>
>> >> + * bandwidth available. UINT_MAX can be used as shorthand<br>
>> >> for<br>
>> >> + * "finest hashing mode available".<br>
>> >> + */<br>
>> >> +void<br>
>> >> +genX(cmd_buffer_emit_hashing_mode)(struct anv_cmd_buffer *cmd_buffer,<br>
>> >> + unsigned width, unsigned height,<br>
>> >> + unsigned scale)<br>
>> >> +{<br>
>> >> +#if GEN_GEN == 9<br>
>> >> + const struct gen_device_info *devinfo = &cmd_buffer->device->info;<br>
>> >> + const unsigned slice_hashing[] = {<br>
>> >> + /* Because all Gen9 platforms with more than one slice require<br>
>> >> + * three-way subslice hashing, a single "normal" 16x16 slice<br>
>> hashing<br>
>> >> + * block is guaranteed to suffer from substantial imbalance, with<br>
>> >> one<br>
>> >> + * subslice receiving twice as much work as the other two in the<br>
>> >> + * slice.<br>
>> >> + *<br>
>> >> + * The performance impact of that would be particularly severe<br>
>> when<br>
>> >> + * three-way hashing is also in use for slice balancing (which is<br>
>> >> the<br>
>> >> + * case for all Gen9 GT4 platforms), because one of the slices<br>
>> >> + * receives one every three 16x16 blocks in either direction,<br>
>> which<br>
>> >> + * is roughly the periodicity of the underlying subslice<br>
>> imbalance<br>
>> >> + * pattern ("roughly" because in reality the hardware's<br>
>> >> + * implementation of three-way hashing doesn't do exact modulo 3<br>
>> >> + * arithmetic, which somewhat decreases the magnitude of this<br>
>> effect<br>
>> >> + * in practice). This leads to a systematic subslice imbalance<br>
>> >> + * within that slice regardless of the size of the primitive.<br>
>> The<br>
>> >> + * 32x32 hashing mode guarantees that the subslice imbalance<br>
>> within<br>
>> >> a<br>
>> >> + * single slice hashing block is minimal, largely eliminating<br>
>> this<br>
>> >> + * effect.<br>
>> >> + */<br>
>> >> + _32x32,<br>
>> >> + /* Finest slice hashing mode available. */<br>
>> >> + NORMAL<br>
>> >> + };<br>
>> >> + const unsigned subslice_hashing[] = {<br>
>> >> + /* 16x16 would provide a slight cache locality benefit especially<br>
>> >> + * visible in the sampler L1 cache efficiency of low-bandwidth<br>
>> >> + * non-LLC platforms, but it comes at the cost of greater<br>
>> subslice<br>
>> >> + * imbalance for primitives of dimensions approximately<br>
>> intermediate<br>
>> >> + * between 16x4 and 16x16.<br>
>> >> + */<br>
>> >> + _16x4,<br>
>> >> + /* Finest subslice hashing mode available. */<br>
>> >> + _8x4<br>
>> >> + };<br>
>> >> + /* Dimensions of the smallest hashing block of a given hashing mode.<br>
>> >> If<br>
>> >> + * the rendering area is smaller than this there can't possibly be<br>
>> any<br>
>> >> + * benefit from switching to this mode, so we optimize out the<br>
>> >> + * transition.<br>
>> >> + */<br>
>> >> + const unsigned min_size[][2] = {<br>
>> >> + { 16, 4 },<br>
>> >> + { 8, 4 }<br>
>> >> + };<br>
>> >> + const unsigned idx = scale > 1;<br>
>> >> +<br>
>> >> + if (width > min_size[idx][0] || height > min_size[idx][1]) {<br>
>> >> + uint32_t gt_mode;<br>
>> >> +<br>
>> >> + anv_pack_struct(>_mode, GENX(GT_MODE),<br>
>> >> + .SliceHashing = (devinfo->num_slices > 1 ?<br>
>> >> slice_hashing[idx] : 0),<br>
>> >> + .SliceHashingMask = (devinfo->num_slices > 1 ?<br>
>> -1 :<br>
>> >> 0),<br>
>> >> + .SubsliceHashing = subslice_hashing[idx],<br>
>> >> + .SubsliceHashingMask = -1);<br>
>> >> +<br>
>> >> + anv_batch_emit(&cmd_buffer->batch, GENX(PIPE_CONTROL), pc) {<br>
>> >> + pc.StallAtPixelScoreboard = true;<br>
>> >> + pc.CommandStreamerStallEnable = true;<br>
>> >> + }<br>
>> >><br>
>> ><br>
>> > This would be better as<br>
>> ><br>
>> > cmd_buffer->state.pending_pipe_bits |= ANV_PIPE_CS_STALL_BIT;<br>
>> > genX(cmd_buffer_apply_pipe_flushes)(cmd_buffer);<br>
>> ><br>
>> > That way workarounds get applied more properly and we can avoid redundant<br>
>> > PIPE_CONTROLs. apply_pipe_flushes will insert the StallAtPixelScoreboard<br>
>> > for you if needed. That said, maybe it's best to also OR in<br>
>> > ANV_PIPE_STALL_AT_SCOREBOARD_BIT just to be sure.<br>
>> ><br>
>><br>
>> OK, changed locally.<br>
>><br>
>> > Other than that, the patch looks good to me. Happy to see the perf<br>
>> bump. :)<br>
>> ><br>
>><br>
>> Thanks for testing it out!<br>
>><br>
>> > I did consider maybe trying to move things around so that the change is<br>
>> > done more as a subpass thing but there are some subtle interactions to<br>
>> > think through there and I'm convinced what you have here is correct so<br>
>> > let's go with it for now.<br>
>> ><br>
>> > --Jason<br>
>> ><br>
>> ><br>
>> >> +<br>
>> >> + emit_lri(&cmd_buffer->batch, GENX(GT_MODE_num), gt_mode);<br>
>> >> +<br>
>> >> + cmd_buffer->state.current_hash_scale = scale;<br>
>> >> + }<br>
>> >> +#endif<br>
>> >> +}<br>
>> >> +<br>
>> >> static void<br>
>> >> cmd_buffer_emit_depth_stencil(struct anv_cmd_buffer *cmd_buffer)<br>
>> >> {<br>
>> >> --<br>
>> >> 2.22.0<br>
>> >><br>
>> >> _______________________________________________<br>
>> >> mesa-dev mailing list<br>
>> >> <a href="mailto:mesa-dev@lists.freedesktop.org" target="_blank">mesa-dev@lists.freedesktop.org</a><br>
>> >> <a href="https://lists.freedesktop.org/mailman/listinfo/mesa-dev" rel="noreferrer" target="_blank">https://lists.freedesktop.org/mailman/listinfo/mesa-dev</a><br>
>><br>
</blockquote></div></div>