[Mesa-dev] [PATCH resend 3/7] i965: Enable hardware-generated binding tables on render path.

Kenneth Graunke kenneth at whitecape.org
Tue Jun 2 00:54:06 PDT 2015


On Monday, June 01, 2015 03:14:26 PM Abdiel Janulgue wrote:
> This patch implements the binding table enable command which is also
> used to allocate a binding table pool where where hardware-generated
> binding table entries are flushed into. Each binding table offset in
> the binding table pool is unique per each shader stage that are
> enabled within a batch.
> 
> Also insert the required brw_tracked_state objects to enable
> hw-generated binding tables in normal render path.
> 
> v2: Clarify start of binding table pool offsets. (Topi)
> 
> Signed-off-by: Abdiel Janulgue <abdiel.janulgue at linux.intel.com>
> ---
>  src/mesa/drivers/dri/i965/brw_binding_tables.c | 101 +++++++++++++++++++++++++
>  src/mesa/drivers/dri/i965/brw_context.c        |   4 +
>  src/mesa/drivers/dri/i965/brw_context.h        |   6 ++
>  src/mesa/drivers/dri/i965/brw_state.h          |   6 ++
>  src/mesa/drivers/dri/i965/brw_state_upload.c   |   4 +
>  src/mesa/drivers/dri/i965/intel_batchbuffer.c  |   4 +
>  6 files changed, 125 insertions(+)
> 
> diff --git a/src/mesa/drivers/dri/i965/brw_binding_tables.c b/src/mesa/drivers/dri/i965/brw_binding_tables.c
> index 98ff0dd..7554818 100644
> --- a/src/mesa/drivers/dri/i965/brw_binding_tables.c
> +++ b/src/mesa/drivers/dri/i965/brw_binding_tables.c
> @@ -45,6 +45,23 @@
>  #include "intel_batchbuffer.h"
>  
>  /**
> + * We are required to start at this offset for binding table pointer state when
> + * HW-generated binding table is enabled otherwise the GPU will hung. Note that

typo: "will hang" (not hung)

> + * the binding table offsets are now relative to the binding tabe pool base

typo: "table pool base" (not tabe pool)

> + * address instead of from the state batch.
> + *
> + * From the Bspec 3DSTATE_BINDING_TABLE_POINTERS_{PS/VS/GS/DS/HS} > Pointer to
> + * PS Binding Table section lists the format as:
> + *
> + *	"SurfaceStateOffset[16:6]BINDING_TABLE_STATE*256 When
> + *	HW-generated binding table is enabled"
> + *
> + * When HW-generated binding tables are enabled, Surface State Offsets are
> + * 16-bit entries.
> + */
> +static const uint32_t hw_bt_start_offset = 256 * sizeof(uint16_t);

I'm confused.  I don't see how this citation is relevant.  Binding
tables are always 256 entries, regardless of the resource streamer.

The text you cited is just saying that entries need to be 64 byte
aligned when using hardware binding tables, rather than the 32 byte
alignment that's sufficient when using software binding tables.

It does appear that we gain an extra bit - bit 16 used to MBZ, but
now it's valid.

You enforce the 64-byte alignment restriction in patch 5, via this line:

      brw->hw_bt_pool.next_offset += ALIGN(prog_data->binding_table.size_bytes,
                                           64);

At any rate, the text doesn't say *anything* about having to add
something to the starting address.  Offset 0 seems perfectly valid.

Is this really necessary?

> +
> +/**
>   * Upload a shader stage's binding table as indirect state.
>   *
>   * This copies brw_stage_state::surf_offset[] into the indirect state section
> @@ -170,6 +187,90 @@ const struct brw_tracked_state brw_gs_binding_table = {
>     .emit = brw_gs_upload_binding_table,
>  };
>  
> +/**
> + * Hardware-generated binding tables for the resource streamer
> + */
> +void
> +gen7_disable_hw_binding_tables(struct brw_context *brw)
> +{
> +   int pkt_len = brw->gen >= 8 ? 4 : 3;
> +
> +   BEGIN_BATCH(3);
> +   OUT_BATCH(_3DSTATE_BINDING_TABLE_POOL_ALLOC << 16 | (pkt_len - 2));
> +   OUT_BATCH(brw->is_haswell ? HSW_HW_BINDING_TABLE_RESERVED : 0);
> +   OUT_BATCH(0);
> +   ADVANCE_BATCH();
> +
> +   /* From the BSpec, 3D Pipeline > Resource Streamer > Hardware Binding
> +    * Tables > Programming note
> +
> +    * "When switching between HW and SW binding table generation, SW must
> +    * issue a state cache invalidate."
> +    */

Please cite the public documentation where possible.

/* From the Haswell PRM, Volume 7: 3D Media GPGPU, 
 * 3DSTATE_BINDING_TABLE_POOL_ALLOC > Programming Note:
 *
 * "When switching between HW and SW binding table generation, SW must
 *  issue a state cache invalidate."
 */

> +   brw_emit_pipe_control_flush(brw, PIPE_CONTROL_STATE_CACHE_INVALIDATE);

Is it worth throwing out hw_bt_pool.bo?

> +}
> +
> +void
> +gen7_enable_hw_binding_tables(struct brw_context *brw)
> +{
> +   if (!brw->has_resource_streamer) {
> +      gen7_disable_hw_binding_tables(brw);
> +      return;

kind of funny having a function called gen7_enable_hw_binding_tables
that disables hardware binding tables, but...hey :)

> +   }
> +
> +   if (!brw->hw_bt_pool.bo) {
> +      /* We use a single re-usable buffer object for the lifetime of the
> +       * context and size it to maximum allowed binding tables that can be
> +       * programmed per batch:
> +       *
> +       * BSpec, 3D Pipeline > Resource Streamer > Hardware Binding Tables:
> +       * "A maximum of 16,383 Binding tables are allowed in any batch buffer"
> +       */

/* From the Haswell PRM, Volume 7: 3D Media GPGPU, 
 * 3DSTATE_BINDING_TABLE_POOL_ALLOC > Programming Note:
 *
 * "A maximum of 16,383 Binding Tables are allowed in any batch buffer."
 */

> +      static const int max_size = 16383 * 4;

While I believe this does enforce the restriction, it's a bit strange.

My reading of the restriction is that you can only have 16,383 *binding
tables*, each of which could have 1-256 entries.  So the maximum
possible value would be 16383 * 256 * 4.  But, you'd have to count the
number of binding tables you emit...which seems like a pain.

This enforces that no more than 16,383 binding table *entries* can be
put in the buffer, which seems OK.  It also means that a batch buffer
can't refer to more than that many...

That said, I think you've got a nasty bug here: there is no code to
detect when you've filled up the buffer (unless I've missed something?).
Presumably, the batch buffer usually fills up, triggering an implicit
intel_batchbuffer_flush that calls gen7_reset_rs_pool_offsets() and
saves you from buffer overflow.

However, if the batch buffer were larger, or we repeatedly drew using
shaders that have large binding tables and relatively few other state
changes, I imagine it could fill up and access out of bounds memory,
crashing the GPU.  Best to guard against this explicitly.

> +      brw->hw_bt_pool.bo = drm_intel_bo_alloc(brw->bufmgr, "hw_bt",
> +                                              max_size, 64);
> +      brw->hw_bt_pool.next_offset = hw_bt_start_offset;
> +   }
> +
> +   int pkt_len = brw->gen >= 8 ? 4 : 3;
> +   uint32_t dw1 = BRW_HW_BINDING_TABLE_ENABLE;
> +   if (brw->is_haswell)
> +      dw1 |= SET_FIELD(GEN7_MOCS_L3, GEN7_HW_BT_MOCS) |
> +         HSW_HW_BINDING_TABLE_RESERVED;

We should set MOCS on Broadwell too.  (Note that it has a different bit
location...)

> +
> +   BEGIN_BATCH(3);
> +   OUT_BATCH(_3DSTATE_BINDING_TABLE_POOL_ALLOC << 16 | (pkt_len - 2));
> +   if (brw->gen >= 8)
> +      OUT_RELOC64(brw->hw_bt_pool.bo, I915_GEM_DOMAIN_SAMPLER, 0, dw1);
> +   else
> +      OUT_RELOC(brw->hw_bt_pool.bo, I915_GEM_DOMAIN_SAMPLER, 0, dw1);
> +   OUT_RELOC(brw->hw_bt_pool.bo, I915_GEM_DOMAIN_SAMPLER, 0,
> +             brw->hw_bt_pool.bo->size);

This is wrong for Broadwell.  On Broadwell, you simply program the
buffer size, not a relocation to the maximum address in the buffer.

i.e. it should be

   if (brw->gen >= 8) {
      OUT_RELOC64(brw->hw_bt_pool.bo, I915_GEM_DOMAIN_SAMPLER, 0, dw1);
      OUT_BATCH(brw->hw_bt_pool.bo->size);
   } else {
      OUT_RELOC(brw->hw_bt_pool.bo, I915_GEM_DOMAIN_SAMPLER, 0, dw1);
      OUT_RELOC(brw->hw_bt_pool.bo, I915_GEM_DOMAIN_SAMPLER, 0,
                brw->hw_bt_pool.bo->size);
   }

> +   ADVANCE_BATCH();
> +
> +   /* From the BSpec, 3D Pipeline > Resource Streamer > Hardware Binding
> +    * Tables > Programming note
> +
> +    * "When switching between HW and SW binding table generation, SW must
> +    * issue a state cache invalidate."
> +    */
> +   brw_emit_pipe_control_flush(brw, PIPE_CONTROL_STATE_CACHE_INVALIDATE);
> +}
> +
> +void
> +gen7_reset_rs_pool_offsets(struct brw_context *brw)

Let's call this hw_bt_pool.  The resource streamer is related, but this
is about hardware binding tables.

> +{
> +   brw->hw_bt_pool.next_offset = hw_bt_start_offset;
> +}

I was concerned about trashing data here - if we flush the batch and
reset the offsets, but don't allocate a new buffer, the next batch will
happily scribble over our old data.  But I suppose this is OK: because
the GPU writes the data (via 3DSTATE_BINDING_TABLE_POINTERS_XS?), not
the CPU, it should be pipelined, so the old data will no longer be in
use.  So it should be safe.

> +
> +const struct brw_tracked_state gen7_hw_binding_tables = {
> +   .dirty = {
> +      .mesa = 0,
> +      .brw = BRW_NEW_BATCH,
> +   },
> +   .emit = gen7_enable_hw_binding_tables
> +};
> +
>  /** @} */
>  
>  /**
> diff --git a/src/mesa/drivers/dri/i965/brw_context.c b/src/mesa/drivers/dri/i965/brw_context.c
> index 274a237..e17046a 100644
> --- a/src/mesa/drivers/dri/i965/brw_context.c
> +++ b/src/mesa/drivers/dri/i965/brw_context.c
> @@ -958,6 +958,10 @@ intelDestroyContext(__DRIcontext * driContextPriv)
>     if (brw->wm.base.scratch_bo)
>        drm_intel_bo_unreference(brw->wm.base.scratch_bo);
>  
> +   gen7_reset_rs_pool_offsets(brw);

Not sure if this is worth doing.  Either way I guess.

> +   drm_intel_bo_unreference(brw->hw_bt_pool.bo);
> +   brw->hw_bt_pool.bo = NULL;
> +
>     drm_intel_gem_context_destroy(brw->hw_ctx);
>  
>     if (ctx->swrast_context) {
> diff --git a/src/mesa/drivers/dri/i965/brw_context.h b/src/mesa/drivers/dri/i965/brw_context.h
> index 3f8e59d..94127b6 100644
> --- a/src/mesa/drivers/dri/i965/brw_context.h
> +++ b/src/mesa/drivers/dri/i965/brw_context.h
> @@ -1404,6 +1404,12 @@ struct brw_context
>        struct brw_cs_prog_data *prog_data;
>     } cs;
>  
> +   /* RS hardware binding table */
> +   struct {
> +      drm_intel_bo *bo;
> +      uint32_t next_offset;
> +   } hw_bt_pool;
> +
>     struct {
>        uint32_t state_offset;
>        uint32_t blend_state_offset;
> diff --git a/src/mesa/drivers/dri/i965/brw_state.h b/src/mesa/drivers/dri/i965/brw_state.h
> index 987672f..622ce3f 100644
> --- a/src/mesa/drivers/dri/i965/brw_state.h
> +++ b/src/mesa/drivers/dri/i965/brw_state.h
> @@ -132,6 +132,7 @@ extern const struct brw_tracked_state gen7_sol_state;
>  extern const struct brw_tracked_state gen7_urb;
>  extern const struct brw_tracked_state gen7_vs_state;
>  extern const struct brw_tracked_state gen7_wm_state;
> +extern const struct brw_tracked_state gen7_hw_binding_tables;
>  extern const struct brw_tracked_state haswell_cut_index;
>  extern const struct brw_tracked_state gen8_blend_state;
>  extern const struct brw_tracked_state gen8_disable_stages;
> @@ -372,6 +373,11 @@ gen7_upload_constant_state(struct brw_context *brw,
>                             const struct brw_stage_state *stage_state,
>                             bool active, unsigned opcode);
>  
> +void gen7_rs_control(struct brw_context *brw, int enable);
> +void gen7_enable_hw_binding_tables(struct brw_context *brw);
> +void gen7_disable_hw_binding_tables(struct brw_context *brw);
> +void gen7_reset_rs_pool_offsets(struct brw_context *brw);
> +
>  #ifdef __cplusplus
>  }
>  #endif
> diff --git a/src/mesa/drivers/dri/i965/brw_state_upload.c b/src/mesa/drivers/dri/i965/brw_state_upload.c
> index 84b0861..a2a63f3 100644
> --- a/src/mesa/drivers/dri/i965/brw_state_upload.c
> +++ b/src/mesa/drivers/dri/i965/brw_state_upload.c
> @@ -191,6 +191,8 @@ static const struct brw_tracked_state *gen7_render_atoms[] =
>     &gen6_color_calc_state,	/* must do before cc unit */
>     &gen6_depth_stencil_state,	/* must do before cc unit */
>  
> +   &gen7_hw_binding_tables, /* Enable hw-generated binding tables for Haswell */
> +
>     &gen6_vs_push_constants, /* Before vs_state */
>     &gen6_gs_push_constants, /* Before gs_state */
>     &gen6_wm_push_constants, /* Before wm_surfaces and constant_buffer */
> @@ -267,6 +269,8 @@ static const struct brw_tracked_state *gen8_render_atoms[] =
>     &gen8_blend_state,
>     &gen6_color_calc_state,
>  
> +   &gen7_hw_binding_tables, /* Enable hw-generated binding tables for Broadwell */
> +
>     &gen6_vs_push_constants, /* Before vs_state */
>     &gen6_gs_push_constants, /* Before gs_state */
>     &gen6_wm_push_constants, /* Before wm_surfaces and constant_buffer */
> diff --git a/src/mesa/drivers/dri/i965/intel_batchbuffer.c b/src/mesa/drivers/dri/i965/intel_batchbuffer.c
> index a2a3a95..caeb31b 100644
> --- a/src/mesa/drivers/dri/i965/intel_batchbuffer.c
> +++ b/src/mesa/drivers/dri/i965/intel_batchbuffer.c
> @@ -32,6 +32,7 @@
>  #include "intel_buffers.h"
>  #include "intel_fbo.h"
>  #include "brw_context.h"
> +#include "brw_state.h"
>  
>  #include <xf86drm.h>
>  #include <i915_drm.h>
> @@ -379,6 +380,9 @@ _intel_batchbuffer_flush(struct brw_context *brw,
>        drm_intel_bo_wait_rendering(brw->batch.bo);
>     }
>  
> +   if (brw->gen >= 7)
> +      gen7_reset_rs_pool_offsets(brw);
> +

Checking brw->use_resource_streamer would make more sense, since
Ivybridge is Gen7 and it doesn't need this.  Or, just calling it
universally since it's 1 line of harmless code on other platforms.

>     /* Start a new batch buffer. */
>     brw_new_batch(brw);
-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 819 bytes
Desc: This is a digitally signed message part.
URL: <http://lists.freedesktop.org/archives/mesa-dev/attachments/20150602/08bb5f6f/attachment-0001.sig>


More information about the mesa-dev mailing list