[Mesa-dev] [PATCH 4/4] OPTIONAL: anv/gen9: Optimize slice and subslice load balancing behavior.

Mon Aug 12 15:08:44 UTC 2019

On Sat, Aug 10, 2019 at 4:55 PM Francisco Jerez <currojerez at riseup.net>
wrote:

> Jason Ekstrand <jason at jlekstrand.net> writes:
>
> > On Sat, Aug 10, 2019 at 2:22 PM Francisco Jerez <currojerez at riseup.net>
> > wrote:
> >
> >> Jason Ekstrand <jason at jlekstrand.net> writes:
> >>
> >> > On Fri, Aug 9, 2019 at 7:22 PM Francisco Jerez <currojerez at riseup.net
> >
> >> > wrote:
> >> >
> >> >> See "i965/gen9: Optimize slice and subslice load balancing behavior."
> >> >> for the rationale.  Marked optional because no performance evaluation
> >> >> has been done on this commit, it is provided to match the hashing
> >> >> settings of the Iris driver.  Test reports welcome.
> >> >>
> >> >
> >> > Improves Aztec Ruins by 2.7% (tested 5 times in each configuration).
> I
> >> > also tested Batman: Arkham City but it only reports integer FPS
> numbers
> >> and
> >> > no change was noticable.
> >> >
> >> >
> >> >> ---
> >> >>  src/intel/vulkan/anv_genX.h        |  4 ++
> >> >>  src/intel/vulkan/anv_private.h     |  6 ++
> >> >>  src/intel/vulkan/genX_blorp_exec.c |  6 ++
> >> >>  src/intel/vulkan/genX_cmd_buffer.c | 96
> ++++++++++++++++++++++++++++++
> >> >>  4 files changed, 112 insertions(+)
> >> >>
> >> >> diff --git a/src/intel/vulkan/anv_genX.h
> b/src/intel/vulkan/anv_genX.h
> >> >> index a5435e566a3..06c6b467acf 100644
> >> >> --- a/src/intel/vulkan/anv_genX.h
> >> >> +++ b/src/intel/vulkan/anv_genX.h
> >> >> @@ -44,6 +44,10 @@ void genX(cmd_buffer_apply_pipe_flushes)(struct
> >> >> anv_cmd_buffer *cmd_buffer);
> >> >>
> >> >>  void genX(cmd_buffer_emit_gen7_depth_flush)(struct anv_cmd_buffer
> >> >> *cmd_buffer);
> >> >>
> >> >> +void genX(cmd_buffer_emit_hashing_mode)(struct anv_cmd_buffer
> >> *cmd_buffer,
> >> >> +                                        unsigned width, unsigned
> >> height,
> >> >> +                                        unsigned scale);
> >> >> +
> >> >>  void genX(flush_pipeline_select_3d)(struct anv_cmd_buffer
> *cmd_buffer);
> >> >>  void genX(flush_pipeline_select_gpgpu)(struct anv_cmd_buffer
> >> *cmd_buffer);
> >> >>
> >> >> diff --git a/src/intel/vulkan/anv_private.h
> >> >> b/src/intel/vulkan/anv_private.h
> >> >> index 2465f264354..b381386a716 100644
> >> >> --- a/src/intel/vulkan/anv_private.h
> >> >> +++ b/src/intel/vulkan/anv_private.h
> >> >> @@ -2421,6 +2421,12 @@ struct anv_cmd_state {
> >> >>
> >> >>     bool
> >> >>  conditional_render_enabled;
> >> >>
> >> >> +   /**
> >> >> +    * Last rendering scale argument provided to
> >> >> +    * genX(cmd_buffer_emit_hashing_mode)().
> >> >> +    */
> >> >> +   unsigned                                     current_hash_scale;
> >> >> +
> >> >>     /**
> >> >>      * Array length is anv_cmd_state::pass::attachment_count. Array
> >> >> content is
> >> >>      * valid only when recording a render pass instance.
> >> >> diff --git a/src/intel/vulkan/genX_blorp_exec.c
> >> >> b/src/intel/vulkan/genX_blorp_exec.c
> >> >> index 1592e7f7e3d..e9eedc06696 100644
> >> >> --- a/src/intel/vulkan/genX_blorp_exec.c
> >> >> +++ b/src/intel/vulkan/genX_blorp_exec.c
> >> >> @@ -223,6 +223,12 @@ genX(blorp_exec)(struct blorp_batch *batch,
> >> >>        genX(cmd_buffer_config_l3)(cmd_buffer, cfg);
> >> >>     }
> >> >>
> >> >> +   const unsigned scale = params->fast_clear_op ? UINT_MAX : 1;
> >> >> +   if (cmd_buffer->state.current_hash_scale != scale) {
> >> >> +      genX(cmd_buffer_emit_hashing_mode)(cmd_buffer, params->x1 -
> >> >> params->x0,
> >> >> +                                         params->y1 - params->y0,
> >> scale);
> >> >> +   }
> >> >> +
> >> >>  #if GEN_GEN >= 11
> >> >>     /* The PIPE_CONTROL command description says:
> >> >>      *
> >> >> diff --git a/src/intel/vulkan/genX_cmd_buffer.c
> >> >> b/src/intel/vulkan/genX_cmd_buffer.c
> >> >> index 86ef1663ac4..e9e5570d49f 100644
> >> >> --- a/src/intel/vulkan/genX_cmd_buffer.c
> >> >> +++ b/src/intel/vulkan/genX_cmd_buffer.c
> >> >> @@ -1595,6 +1595,7 @@ genX(CmdExecuteCommands)(
> >> >>      */
> >> >>     primary->state.current_pipeline = UINT32_MAX;
> >> >>     primary->state.current_l3_config = NULL;
> >> >> +   primary->state.current_hash_scale = 0;
> >> >>
> >> >>     /* Each of the secondary command buffers will use its own state
> base
> >> >>      * address.  We need to re-emit state base address for the
> primary
> >> >> after
> >> >> @@ -2663,6 +2664,9 @@ genX(cmd_buffer_flush_state)(struct
> anv_cmd_buffer
> >> >> *cmd_buffer)
> >> >>
> >> >>     genX(cmd_buffer_config_l3)(cmd_buffer, pipeline->urb.l3_config);
> >> >>
> >> >> +   if (cmd_buffer->state.current_hash_scale != 1)
> >> >> +      genX(cmd_buffer_emit_hashing_mode)(cmd_buffer, UINT_MAX,
> >> UINT_MAX,
> >> >> 1);
> >> >>
> >> >
> >> > Can we do the check against current_hash_scale as an early return
> inside
> >> > the emit function since that's already where the "current" value is
> >> > updated?  Maybe call it cmd_buffer_config_hashing_mode if "emit" no
> >> longer
> >> > seems right with that change?  I don't really care about the rename.
> >> >
> >>
> >> That's what my first revision did.  I moved the current_hash_scale check
> >> one level up because the function call overhead was visible in
> >> brw_upload_pipeline_state(), which is quite a hot path in i965.
> >
> >
> > That's a bit surprising but believable if the two are in different files
> > and there's a real function call going on there.
> >
> >
> >>   That
> >> said I don't have any data justifying the change in the Vulkan driver --
> >> Because caller and callee are defined in the same compilation unit it
> >> *might* be that the function is inlined into
> >> genX(cmd_buffer_flush_state), though it's large enough that it may not
> >>
> >
> > I highly doubt it would be a problem.  It's a lot of lines of code but
> once
> > you optimize things a bit, it's actually quite a small function.
> >
> >
> >> be, either way it seemed like a good idea to do things consistently
> >> across drivers.
> >>
> >
> > The functions right next to it for l3$ configuration and PIPELINE_SELECT
> > are both called unconditionally.  IMO, consistency within the Vulkan
> driver
> > is more important than consistency between iris and ANV given that they
> > have completely different state tracking models.
> >
>
> I don't really care one way or another since I have no data justifying
> the micro-optimization in Vulkan (I do have data showing a statistically
> significant benefit in CPU-bound tests from the corresponding change on
> i965).  Will gladly undo the change in exchange for an R-b.
>

Sounds fine with me.  Just to be clear, were you expecting a conditional
review or were you planning to send a v2?  I was waiting for a v2.

> > --Jason
> >
> >
> >> >
> >> >> +
> >> >>     genX(flush_pipeline_select_3d)(cmd_buffer);
> >> >>
> >> >>     if (vb_emit) {
> >> >> @@ -3925,6 +3929,98 @@ genX(cmd_buffer_emit_gen7_depth_flush)(struct
> >> >> anv_cmd_buffer *cmd_buffer)
> >> >>     }
> >> >>  }
> >> >>
> >> >> +/**
> >> >> + * Update the pixel hashing modes that determine the balancing of PS
> >> >> threads
> >> >> + * across subslices and slices.
> >> >> + *
> >> >> + * \param width Width bound of the rendering area (already scaled
> down
> >> if
> >> >> \p
> >> >> + *              scale is greater than 1).
> >> >> + * \param height Height bound of the rendering area (already scaled
> >> down
> >> >> if \p
> >> >> + *               scale is greater than 1).
> >> >> + * \param scale The number of framebuffer samples that could
> >> potentially
> >> >> be
> >> >> + *              affected by an individual channel of the PS thread.
> >> This
> >> >> is
> >> >> + *              typically one for single-sampled rendering, but for
> >> >> operations
> >> >> + *              like CCS resolves and fast clears a single PS
> >> invocation
> >> >> may
> >> >> + *              update a huge number of pixels, in which case a
> finer
> >> >> + *              balancing is desirable in order to maximally utilize
> >> the
> >> >> + *              bandwidth available.  UINT_MAX can be used as
> shorthand
> >> >> for
> >> >> + *              "finest hashing mode available".
> >> >> + */
> >> >> +void
> >> >> +genX(cmd_buffer_emit_hashing_mode)(struct anv_cmd_buffer
> *cmd_buffer,
> >> >> +                                   unsigned width, unsigned height,
> >> >> +                                   unsigned scale)
> >> >> +{
> >> >> +#if GEN_GEN == 9
> >> >> +   const struct gen_device_info *devinfo =
> &cmd_buffer->device->info;
> >> >> +   const unsigned slice_hashing[] = {
> >> >> +      /* Because all Gen9 platforms with more than one slice require
> >> >> +       * three-way subslice hashing, a single "normal" 16x16 slice
> >> hashing
> >> >> +       * block is guaranteed to suffer from substantial imbalance,
> with
> >> >> one
> >> >> +       * subslice receiving twice as much work as the other two in
> the
> >> >> +       * slice.
> >> >> +       *
> >> >> +       * The performance impact of that would be particularly severe
> >> when
> >> >> +       * three-way hashing is also in use for slice balancing
> (which is
> >> >> the
> >> >> +       * case for all Gen9 GT4 platforms), because one of the slices
> >> >> +       * receives one every three 16x16 blocks in either direction,
> >> which
> >> >> +       * is roughly the periodicity of the underlying subslice
> >> imbalance
> >> >> +       * pattern ("roughly" because in reality the hardware's
> >> >> +       * implementation of three-way hashing doesn't do exact
> modulo 3
> >> >> +       * arithmetic, which somewhat decreases the magnitude of this
> >> effect
> >> >> +       * in practice).  This leads to a systematic subslice
> imbalance
> >> >> +       * within that slice regardless of the size of the primitive.
> >> The
> >> >> +       * 32x32 hashing mode guarantees that the subslice imbalance
> >> within
> >> >> a
> >> >> +       * single slice hashing block is minimal, largely eliminating
> >> this
> >> >> +       * effect.
> >> >> +       */
> >> >> +      _32x32,
> >> >> +      /* Finest slice hashing mode available. */
> >> >> +      NORMAL
> >> >> +   };
> >> >> +   const unsigned subslice_hashing[] = {
> >> >> +      /* 16x16 would provide a slight cache locality benefit
> especially
> >> >> +       * visible in the sampler L1 cache efficiency of low-bandwidth
> >> >> +       * non-LLC platforms, but it comes at the cost of greater
> >> subslice
> >> >> +       * imbalance for primitives of dimensions approximately
> >> intermediate
> >> >> +       * between 16x4 and 16x16.
> >> >> +       */
> >> >> +      _16x4,
> >> >> +      /* Finest subslice hashing mode available. */
> >> >> +      _8x4
> >> >> +   };
> >> >> +   /* Dimensions of the smallest hashing block of a given hashing
> mode.
> >> >> If
> >> >> +    * the rendering area is smaller than this there can't possibly
> be
> >> any
> >> >> +    * benefit from switching to this mode, so we optimize out the
> >> >> +    * transition.
> >> >> +    */
> >> >> +   const unsigned min_size[][2] = {
> >> >> +         { 16, 4 },
> >> >> +         { 8, 4 }
> >> >> +   };
> >> >> +   const unsigned idx = scale > 1;
> >> >> +
> >> >> +   if (width > min_size[idx][0] || height > min_size[idx][1]) {
> >> >> +      uint32_t gt_mode;
> >> >> +
> >> >> +      anv_pack_struct(&gt_mode, GENX(GT_MODE),
> >> >> +                      .SliceHashing = (devinfo->num_slices > 1 ?
> >> >> slice_hashing[idx] : 0),
> >> >> +                      .SliceHashingMask = (devinfo->num_slices > 1 ?
> >> -1 :
> >> >> 0),
> >> >> +                      .SubsliceHashing = subslice_hashing[idx],
> >> >> +                      .SubsliceHashingMask = -1);
> >> >> +
> >> >> +      anv_batch_emit(&cmd_buffer->batch, GENX(PIPE_CONTROL), pc) {
> >> >> +         pc.StallAtPixelScoreboard = true;
> >> >> +         pc.CommandStreamerStallEnable = true;
> >> >> +      }
> >> >>
> >> >
> >> > This would be better as
> >> >
> >> > cmd_buffer->state.pending_pipe_bits |= ANV_PIPE_CS_STALL_BIT;
> >> > genX(cmd_buffer_apply_pipe_flushes)(cmd_buffer);
> >> >
> >> > That way workarounds get applied more properly and we can avoid
> redundant
> >> > PIPE_CONTROLs.  apply_pipe_flushes will insert the
> StallAtPixelScoreboard
> >> > for you if needed.  That said, maybe it's best to also OR in
> >> > ANV_PIPE_STALL_AT_SCOREBOARD_BIT just to be sure.
> >> >
> >>
> >> OK, changed locally.
> >>
> >> > Other than that, the patch looks good to me.  Happy to see the perf
> >> bump. :)
> >> >
> >>
> >> Thanks for testing it out!
> >>
> >> > I did consider maybe trying to move things around so that the change
> is
> >> > done more as a subpass thing but there are some subtle interactions to
> >> > think through there and I'm convinced what you have here is correct so
> >> > let's go with it for now.
> >> >
> >> > --Jason
> >> >
> >> >
> >> >> +
> >> >> +      emit_lri(&cmd_buffer->batch, GENX(GT_MODE_num), gt_mode);
> >> >> +
> >> >> +      cmd_buffer->state.current_hash_scale = scale;
> >> >> +   }
> >> >> +#endif
> >> >> +}
> >> >> +
> >> >>  static void
> >> >>  cmd_buffer_emit_depth_stencil(struct anv_cmd_buffer *cmd_buffer)
> >> >>  {
> >> >> --
> >> >> 2.22.0
> >> >>
> >> >> _______________________________________________
> >> >> mesa-dev mailing list
> >> >> mesa-dev at lists.freedesktop.org
> >> >> https://lists.freedesktop.org/mailman/listinfo/mesa-dev
> >>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://lists.freedesktop.org/archives/mesa-dev/attachments/20190812/24a5a3e0/attachment-0001.html>