[Mesa-dev] [PATCH resend 3/7] i965: Enable hardware-generated binding tables on render path.
Abdiel Janulgue
abdiel.janulgue at linux.intel.com
Thu Jun 4 01:38:52 PDT 2015
On 06/02/2015 10:54 AM, Kenneth Graunke wrote:
> On Monday, June 01, 2015 03:14:26 PM Abdiel Janulgue wrote:
>> This patch implements the binding table enable command which is also
>> used to allocate a binding table pool where where hardware-generated
>> binding table entries are flushed into. Each binding table offset in
>> the binding table pool is unique per each shader stage that are
>> enabled within a batch.
>>
>> Also insert the required brw_tracked_state objects to enable
>> hw-generated binding tables in normal render path.
>>
>> v2: Clarify start of binding table pool offsets. (Topi)
>>
>> Signed-off-by: Abdiel Janulgue <abdiel.janulgue at linux.intel.com>
>> ---
>> src/mesa/drivers/dri/i965/brw_binding_tables.c | 101 +++++++++++++++++++++++++
>> src/mesa/drivers/dri/i965/brw_context.c | 4 +
>> src/mesa/drivers/dri/i965/brw_context.h | 6 ++
>> src/mesa/drivers/dri/i965/brw_state.h | 6 ++
>> src/mesa/drivers/dri/i965/brw_state_upload.c | 4 +
>> src/mesa/drivers/dri/i965/intel_batchbuffer.c | 4 +
>> 6 files changed, 125 insertions(+)
>>
>> diff --git a/src/mesa/drivers/dri/i965/brw_binding_tables.c b/src/mesa/drivers/dri/i965/brw_binding_tables.c
>> index 98ff0dd..7554818 100644
>> --- a/src/mesa/drivers/dri/i965/brw_binding_tables.c
>> +++ b/src/mesa/drivers/dri/i965/brw_binding_tables.c
>> @@ -45,6 +45,23 @@
>> #include "intel_batchbuffer.h"
>>
>> /**
>> + * We are required to start at this offset for binding table pointer state when
>> + * HW-generated binding table is enabled otherwise the GPU will hung. Note that
>
> typo: "will hang" (not hung)
>
>> + * the binding table offsets are now relative to the binding tabe pool base
>
> typo: "table pool base" (not tabe pool)
>
>> + * address instead of from the state batch.
>> + *
>> + * From the Bspec 3DSTATE_BINDING_TABLE_POINTERS_{PS/VS/GS/DS/HS} > Pointer to
>> + * PS Binding Table section lists the format as:
>> + *
>> + * "SurfaceStateOffset[16:6]BINDING_TABLE_STATE*256 When
>> + * HW-generated binding table is enabled"
>> + *
>> + * When HW-generated binding tables are enabled, Surface State Offsets are
>> + * 16-bit entries.
>> + */
>> +static const uint32_t hw_bt_start_offset = 256 * sizeof(uint16_t);
>
> I'm confused. I don't see how this citation is relevant. Binding
> tables are always 256 entries, regardless of the resource streamer.
>
> The text you cited is just saying that entries need to be 64 byte
> aligned when using hardware binding tables, rather than the 32 byte
> alignment that's sufficient when using software binding tables.
>
> It does appear that we gain an extra bit - bit 16 used to MBZ, but
> now it's valid.
>
> You enforce the 64-byte alignment restriction in patch 5, via this line:
>
> brw->hw_bt_pool.next_offset += ALIGN(prog_data->binding_table.size_bytes,
> 64);
>
> At any rate, the text doesn't say *anything* about having to add
> something to the starting address. Offset 0 seems perfectly valid.
>
> Is this really necessary?
>
I'm the one who is being funny here. After looking harder and then doing
some archaeological digs in my previous RS enabling efforts. I came to
the conclusion you are right. The reason the hardware *seems* to enforce
this arbitrary offset is that I skipped out the "disable RS on state
base address update" workaround. Now that I reintroduced it back,
hardware works completely fine even from offset zero. I'll update the
code in v2.
>> +
>> +/**
>> * Upload a shader stage's binding table as indirect state.
>> *
>> * This copies brw_stage_state::surf_offset[] into the indirect state section
>> @@ -170,6 +187,90 @@ const struct brw_tracked_state brw_gs_binding_table = {
>> .emit = brw_gs_upload_binding_table,
>> };
>>
>> +/**
>> + * Hardware-generated binding tables for the resource streamer
>> + */
>> +void
>> +gen7_disable_hw_binding_tables(struct brw_context *brw)
>> +{
>> + int pkt_len = brw->gen >= 8 ? 4 : 3;
>> +
>> + BEGIN_BATCH(3);
>> + OUT_BATCH(_3DSTATE_BINDING_TABLE_POOL_ALLOC << 16 | (pkt_len - 2));
>> + OUT_BATCH(brw->is_haswell ? HSW_HW_BINDING_TABLE_RESERVED : 0);
>> + OUT_BATCH(0);
>> + ADVANCE_BATCH();
>> +
>> + /* From the BSpec, 3D Pipeline > Resource Streamer > Hardware Binding
>> + * Tables > Programming note
>> +
>> + * "When switching between HW and SW binding table generation, SW must
>> + * issue a state cache invalidate."
>> + */
>
> Please cite the public documentation where possible.
>
> /* From the Haswell PRM, Volume 7: 3D Media GPGPU,
> * 3DSTATE_BINDING_TABLE_POOL_ALLOC > Programming Note:
> *
> * "When switching between HW and SW binding table generation, SW must
> * issue a state cache invalidate."
> */
>
>> + brw_emit_pipe_control_flush(brw, PIPE_CONTROL_STATE_CACHE_INVALIDATE);
>
> Is it worth throwing out hw_bt_pool.bo?
>
Probably not, I was hoping to have that bo as re-usable so we don't have
to re-allocate after evey submit. As you mentioned below, I guess we
don't have to worry about mucking up the offsets since only GPU writes
to the pool.
>> +}
>> +
>> +void
>> +gen7_enable_hw_binding_tables(struct brw_context *brw)
>> +{
>> + if (!brw->has_resource_streamer) {
>> + gen7_disable_hw_binding_tables(brw);
>> + return;
>
> kind of funny having a function called gen7_enable_hw_binding_tables
> that disables hardware binding tables, but...hey :)
>
>> + }
>> +
>> + if (!brw->hw_bt_pool.bo) {
>> + /* We use a single re-usable buffer object for the lifetime of the
>> + * context and size it to maximum allowed binding tables that can be
>> + * programmed per batch:
>> + *
>> + * BSpec, 3D Pipeline > Resource Streamer > Hardware Binding Tables:
>> + * "A maximum of 16,383 Binding tables are allowed in any batch buffer"
>> + */
>
> /* From the Haswell PRM, Volume 7: 3D Media GPGPU,
> * 3DSTATE_BINDING_TABLE_POOL_ALLOC > Programming Note:
> *
> * "A maximum of 16,383 Binding Tables are allowed in any batch buffer."
> */
>
>> + static const int max_size = 16383 * 4;
>
> While I believe this does enforce the restriction, it's a bit strange.
>
> My reading of the restriction is that you can only have 16,383 *binding
> tables*, each of which could have 1-256 entries. So the maximum
> possible value would be 16383 * 256 * 4. But, you'd have to count the
> number of binding tables you emit...which seems like a pain.
>
> This enforces that no more than 16,383 binding table *entries* can be
> put in the buffer, which seems OK. It also means that a batch buffer
> can't refer to more than that many...
>
> That said, I think you've got a nasty bug here: there is no code to
> detect when you've filled up the buffer (unless I've missed something?).
> Presumably, the batch buffer usually fills up, triggering an implicit
> intel_batchbuffer_flush that calls gen7_reset_rs_pool_offsets() and
> saves you from buffer overflow.
>
> However, if the batch buffer were larger, or we repeatedly drew using
> shaders that have large binding tables and relatively few other state
> changes, I imagine it could fill up and access out of bounds memory,
> crashing the GPU. Best to guard against this explicitly.
>
>> + brw->hw_bt_pool.bo = drm_intel_bo_alloc(brw->bufmgr, "hw_bt",
>> + max_size, 64);
>> + brw->hw_bt_pool.next_offset = hw_bt_start_offset;
>> + }
>> +
>> + int pkt_len = brw->gen >= 8 ? 4 : 3;
>> + uint32_t dw1 = BRW_HW_BINDING_TABLE_ENABLE;
>> + if (brw->is_haswell)
>> + dw1 |= SET_FIELD(GEN7_MOCS_L3, GEN7_HW_BT_MOCS) |
>> + HSW_HW_BINDING_TABLE_RESERVED;
>
> We should set MOCS on Broadwell too. (Note that it has a different bit
> location...)
>
>> +
>> + BEGIN_BATCH(3);
>> + OUT_BATCH(_3DSTATE_BINDING_TABLE_POOL_ALLOC << 16 | (pkt_len - 2));
>> + if (brw->gen >= 8)
>> + OUT_RELOC64(brw->hw_bt_pool.bo, I915_GEM_DOMAIN_SAMPLER, 0, dw1);
>> + else
>> + OUT_RELOC(brw->hw_bt_pool.bo, I915_GEM_DOMAIN_SAMPLER, 0, dw1);
>> + OUT_RELOC(brw->hw_bt_pool.bo, I915_GEM_DOMAIN_SAMPLER, 0,
>> + brw->hw_bt_pool.bo->size);
>
> This is wrong for Broadwell. On Broadwell, you simply program the
> buffer size, not a relocation to the maximum address in the buffer.
>
> i.e. it should be
>
> if (brw->gen >= 8) {
> OUT_RELOC64(brw->hw_bt_pool.bo, I915_GEM_DOMAIN_SAMPLER, 0, dw1);
> OUT_BATCH(brw->hw_bt_pool.bo->size);
> } else {
> OUT_RELOC(brw->hw_bt_pool.bo, I915_GEM_DOMAIN_SAMPLER, 0, dw1);
> OUT_RELOC(brw->hw_bt_pool.bo, I915_GEM_DOMAIN_SAMPLER, 0,
> brw->hw_bt_pool.bo->size);
> }
>
>> + ADVANCE_BATCH();
>> +
>> + /* From the BSpec, 3D Pipeline > Resource Streamer > Hardware Binding
>> + * Tables > Programming note
>> +
>> + * "When switching between HW and SW binding table generation, SW must
>> + * issue a state cache invalidate."
>> + */
>> + brw_emit_pipe_control_flush(brw, PIPE_CONTROL_STATE_CACHE_INVALIDATE);
>> +}
>> +
>> +void
>> +gen7_reset_rs_pool_offsets(struct brw_context *brw)
>
> Let's call this hw_bt_pool. The resource streamer is related, but this
> is about hardware binding tables.
>
>> +{
>> + brw->hw_bt_pool.next_offset = hw_bt_start_offset;
>> +}
>
> I was concerned about trashing data here - if we flush the batch and
> reset the offsets, but don't allocate a new buffer, the next batch will
> happily scribble over our old data. But I suppose this is OK: because
> the GPU writes the data (via 3DSTATE_BINDING_TABLE_POINTERS_XS?), not
> the CPU, it should be pipelined, so the old data will no longer be in
> use. So it should be safe.
Yep, CPU does not write to the binding table pool at all.
>
>> +
>> +const struct brw_tracked_state gen7_hw_binding_tables = {
>> + .dirty = {
>> + .mesa = 0,
>> + .brw = BRW_NEW_BATCH,
>> + },
>> + .emit = gen7_enable_hw_binding_tables
>> +};
>> +
>> /** @} */
>>
>> /**
>> diff --git a/src/mesa/drivers/dri/i965/brw_context.c b/src/mesa/drivers/dri/i965/brw_context.c
>> index 274a237..e17046a 100644
>> --- a/src/mesa/drivers/dri/i965/brw_context.c
>> +++ b/src/mesa/drivers/dri/i965/brw_context.c
>> @@ -958,6 +958,10 @@ intelDestroyContext(__DRIcontext * driContextPriv)
>> if (brw->wm.base.scratch_bo)
>> drm_intel_bo_unreference(brw->wm.base.scratch_bo);
>>
>> + gen7_reset_rs_pool_offsets(brw);
>
> Not sure if this is worth doing. Either way I guess.
>
>> + drm_intel_bo_unreference(brw->hw_bt_pool.bo);
>> + brw->hw_bt_pool.bo = NULL;
>> +
>> drm_intel_gem_context_destroy(brw->hw_ctx);
>>
>> if (ctx->swrast_context) {
>> diff --git a/src/mesa/drivers/dri/i965/brw_context.h b/src/mesa/drivers/dri/i965/brw_context.h
>> index 3f8e59d..94127b6 100644
>> --- a/src/mesa/drivers/dri/i965/brw_context.h
>> +++ b/src/mesa/drivers/dri/i965/brw_context.h
>> @@ -1404,6 +1404,12 @@ struct brw_context
>> struct brw_cs_prog_data *prog_data;
>> } cs;
>>
>> + /* RS hardware binding table */
>> + struct {
>> + drm_intel_bo *bo;
>> + uint32_t next_offset;
>> + } hw_bt_pool;
>> +
>> struct {
>> uint32_t state_offset;
>> uint32_t blend_state_offset;
>> diff --git a/src/mesa/drivers/dri/i965/brw_state.h b/src/mesa/drivers/dri/i965/brw_state.h
>> index 987672f..622ce3f 100644
>> --- a/src/mesa/drivers/dri/i965/brw_state.h
>> +++ b/src/mesa/drivers/dri/i965/brw_state.h
>> @@ -132,6 +132,7 @@ extern const struct brw_tracked_state gen7_sol_state;
>> extern const struct brw_tracked_state gen7_urb;
>> extern const struct brw_tracked_state gen7_vs_state;
>> extern const struct brw_tracked_state gen7_wm_state;
>> +extern const struct brw_tracked_state gen7_hw_binding_tables;
>> extern const struct brw_tracked_state haswell_cut_index;
>> extern const struct brw_tracked_state gen8_blend_state;
>> extern const struct brw_tracked_state gen8_disable_stages;
>> @@ -372,6 +373,11 @@ gen7_upload_constant_state(struct brw_context *brw,
>> const struct brw_stage_state *stage_state,
>> bool active, unsigned opcode);
>>
>> +void gen7_rs_control(struct brw_context *brw, int enable);
>> +void gen7_enable_hw_binding_tables(struct brw_context *brw);
>> +void gen7_disable_hw_binding_tables(struct brw_context *brw);
>> +void gen7_reset_rs_pool_offsets(struct brw_context *brw);
>> +
>> #ifdef __cplusplus
>> }
>> #endif
>> diff --git a/src/mesa/drivers/dri/i965/brw_state_upload.c b/src/mesa/drivers/dri/i965/brw_state_upload.c
>> index 84b0861..a2a63f3 100644
>> --- a/src/mesa/drivers/dri/i965/brw_state_upload.c
>> +++ b/src/mesa/drivers/dri/i965/brw_state_upload.c
>> @@ -191,6 +191,8 @@ static const struct brw_tracked_state *gen7_render_atoms[] =
>> &gen6_color_calc_state, /* must do before cc unit */
>> &gen6_depth_stencil_state, /* must do before cc unit */
>>
>> + &gen7_hw_binding_tables, /* Enable hw-generated binding tables for Haswell */
>> +
>> &gen6_vs_push_constants, /* Before vs_state */
>> &gen6_gs_push_constants, /* Before gs_state */
>> &gen6_wm_push_constants, /* Before wm_surfaces and constant_buffer */
>> @@ -267,6 +269,8 @@ static const struct brw_tracked_state *gen8_render_atoms[] =
>> &gen8_blend_state,
>> &gen6_color_calc_state,
>>
>> + &gen7_hw_binding_tables, /* Enable hw-generated binding tables for Broadwell */
>> +
>> &gen6_vs_push_constants, /* Before vs_state */
>> &gen6_gs_push_constants, /* Before gs_state */
>> &gen6_wm_push_constants, /* Before wm_surfaces and constant_buffer */
>> diff --git a/src/mesa/drivers/dri/i965/intel_batchbuffer.c b/src/mesa/drivers/dri/i965/intel_batchbuffer.c
>> index a2a3a95..caeb31b 100644
>> --- a/src/mesa/drivers/dri/i965/intel_batchbuffer.c
>> +++ b/src/mesa/drivers/dri/i965/intel_batchbuffer.c
>> @@ -32,6 +32,7 @@
>> #include "intel_buffers.h"
>> #include "intel_fbo.h"
>> #include "brw_context.h"
>> +#include "brw_state.h"
>>
>> #include <xf86drm.h>
>> #include <i915_drm.h>
>> @@ -379,6 +380,9 @@ _intel_batchbuffer_flush(struct brw_context *brw,
>> drm_intel_bo_wait_rendering(brw->batch.bo);
>> }
>>
>> + if (brw->gen >= 7)
>> + gen7_reset_rs_pool_offsets(brw);
>> +
>
> Checking brw->use_resource_streamer would make more sense, since
> Ivybridge is Gen7 and it doesn't need this. Or, just calling it
> universally since it's 1 line of harmless code on other platforms.
>
I'll put this in.
>> /* Start a new batch buffer. */
>> brw_new_batch(brw);
More information about the mesa-dev
mailing list