[Mesa-dev] [PATCH resend 3/7] i965: Enable hardware-generated binding tables on render path.

Tue Jun 9 05:01:00 PDT 2015

Thanks a lot Ken! Your help in debugging this pointed me in the right
direction! I had some further observations below:

On 06/09/2015 11:08 AM, Kenneth Graunke wrote:

> 
> (To bring the mailing list up to speed: Abdiel mentioned on IRC tonight
> that this is actually still necessary---some Piglit tests worked with
> the offset removed, but real applications didn't work.  I then noticed
> that even "shader_runner glsl-fs-texture2d.shader_test" breaks when
> hw_bt_start_offset = 0 - even with the workaround Abdiel mentions).
> 
> Abdiel,
> 
> I think I figured out why this is necessary.  In gen[78]_disable_stages,
> we issue 3DSTATE_BINDING_TABLE_POINTERS_HS/DS packets with a "pointer"
> value of 0.
> 
> In the software binding table case, this points to the start of the
> batch buffer, which is harmless because the disabled HS/DS won't read
> any surfaces.
> 
> However, the hardware binding table case is different: upon receiving
> a 3DSTATE_BINDING_TABLE_POINTERS_XS packet, the hardware *writes* the
> current on-die binding table to the given offset.  This is a maximum
> of 256 16-bit surface state pointers.
> 
> My theory is that if we program legitimate binding tables at offset 0,
> they get clobbered when gen7_disable_stages says that the HS/DS binding
> tables should be written to offset 0.
> 
> By starting at an offset of 256 * sizeof(uint16_t), we are essentially
> allocating a "dummy" binding table of maximum size.
> 
> Three things I tried fixed the problem:
> 1. Remove 3DSTATE_BINDING_TABLE_POINTERS_HS/DS from gen7_disable_stages.
> 
>    We never tell the HW to write out HS/DS tables, so the PS table at
>    offset 0 doesn't get clobbered.

I vaguely remember somehow sometime ago that removing the "nullify
binding table pointer" resulted in some random hangs. But I can't
remember which exact test or game I used. So don't take my word for it.

> 
> 2. Change those packets to use offset 16000 (something large).
> 
>    We write out useless HS/DS tables, but to an unused spot in the
>    buffer, so they don't trash anything.
> 
> 3. Move the gen7_disable_stages atom immediately after the
>    gen7_hw_binding_tables atom in the list.
> 
>    Instead of writing VS/PS tables then clobbering them with HS/DS,
>    we reverse the order: write garbage HS/DS tables, then clobber them
>    with the (actually useful) PS table.
> 

> This brought up a question: how does the hardware know how large of a
> table to write?  Does it always write out all 256 entries?
> 
> It certainly seems to, as far as I can tell.  But that would mean that
> when increasing brw->hw_bt_pool.next_offset, we always need to add
> 256 * sizeof(uint16_t), even if the table only has a few useful entries.
> I'm a bit confused, because we're not doing that today...so shouldn't
> something have broken?

I did further digging and found out that the answer is both yes and no.
Yes because the hardware is actually able to write to the buffer even
without specifying the full 256 entry size. I confirmed this by mapping
out the contents of the valid binding table pool after every flush and
the entries are correct as long as the offsets used are aligned to
64-bytes. It also doesn't overwrite the entries written starting from
the next 3DSTATE_BINDING_TABLE_POINTERS_XS offset.

And no, because for some reason, garbage hardware binding table pointers
(written by gen7_disable in this case) seem to require a full 256-entry
allocation when stating from offset zero. If another
3DSTATE_BINDING_TABLE_POINTERS_XS offset touches an offset within that
"garbage" range then the hardware gets angry. This is the reason why
anything less than 256 * sizeof(uint16_t) didn't work earlier.

To summarize, this is what currently happened when we used offset zero
with the current code:

btp:    valid  -> garbage  ->  valid
offset: 0      -> 0        ->  64

Notice that the hangs here are caused because the valid binding table
pointer is overwritten with the garbage binding table pointer. This is
same conclusion you mentioned earlier. Also, the other "valid" binding
table offset is less than 256 * sizeof(uint16_t).

Because of the same reason, using a start offset for the valid binding
table pointer which is less than 256 * sizeof(uint16_t) hangs too:

btp:    valid   -> garbage  ->  valid
offset: 128     -> 0        ->  128 + 64

However doing this instead works:

btp:    valid -> garbage -> valid
offset: 0     -> 64      -> 64

It's kinda like step 3 you mentioned above except we don't reverse the
order of the atoms but specify brw->hw_bt_pool.next_offset to the
3DSTATE_BINDING_TABLE_POINTERS_HS/DS offsets in gen7_disable instead of
zero. We don't have to allocate space for the garbage hardware binding
table because it gets overwritten with the valid pointer anyway. This
approach worked perfectly as well.

-Abdiel