[Mesa-dev] [PATCH 5/6] i965: Avoid problems from referencing orphaned BOs after growing.

Wed Nov 29 02:50:24 UTC 2017

On Tuesday, November 28, 2017 6:34:20 PM PST Ian Romanick wrote:
> On 11/28/2017 04:13 PM, Kenneth Graunke wrote:
> > diff --git a/src/mesa/drivers/dri/i965/intel_batchbuffer.c b/src/mesa/drivers/dri/i965/intel_batchbuffer.c
> > index b30ffe5d7f6..12d165d7236 100644
> > --- a/src/mesa/drivers/dri/i965/intel_batchbuffer.c
> > +++ b/src/mesa/drivers/dri/i965/intel_batchbuffer.c
> > @@ -183,6 +183,8 @@ intel_batchbuffer_reset(struct brw_context *brw)
> >     batch->last_bo = batch->batch.bo;
> >  
> >     batch->batch.bo = brw_bo_alloc(bufmgr, "batchbuffer", BATCH_SZ, 4096);
> > +   batch->batch.partial_bo = 0;
> 
> Use 0 or NULL in both places (*.partial_bo in this hunk and the next).

Oops, meant to use NULL.  Fixed locally, thanks.

> > +   /* Exchange the two BOs...without breaking pointers to the old BO.
> > +    *
> > +    * Consider this scenario:
> > +    *
> > +    * 1. Somebody calls brw_state_batch() to get a region of memory, and
> > +    *    and then creates a brw_address pointing to brw->batch.state.bo.
> > +    * 2. They then call brw_state_batch() a second time, which happens to
> > +    *    grow and replace the state buffer.  They then try to emit a
> > +    *    relocation to their first section of memory.
> > +    *
> > +    * If we replace the brw->batch.state.bo pointer at step 2, we would
> > +    * break the address created in step 1.  They'd have a pointer to the
> > +    * old destroyed BO.  Emitting a relocation would add this dead BO to
> > +    * the validation list...causing /both/ statebuffers to be in the list,
> > +    * and all kinds of disasters.
> > +    *
> > +    * This is not a contrived case - BLORP vertex data upload hits this.
> 
> Is this the source of the DiRT GPU hangs?

I'm unclear on that.  I tracked this down by hacking brw_state_batch to
"grow" the buffer on every call, so that we hit this path all the time.
I then used glsl-algebraic-add-zero and glxgears as test cases, and
found the BLORP vertex buffer problem there.  I worked around that
problem and ran into others, though, that I didn't track down.

That's why I went this route of "make it work" instead of trying to
adjust every caller to do a safe thing.  It's hard to find them all
and ensure they follow a non-obvious set of ordering constraints.

When I'd fixed the problem, it fixed DiRT.  But I'm not sure which
exact piece of state was hitting the issue.

> > +    *
> > +    * There are worse scenarios too.  Fences for GL sync objects reference
> > +    * brw->batch.batch.bo.  If we replaced the batch pointer when growing,
> > +    * we'd need to chase down every fence and update it to point to the
> > +    * new BO.  Otherwise, it would refer to a "batch" that never actually
> > +    * gets submitted, and would fail to trigger.
> 
> The BLORP case seems really, really hard to test in a reliable way.
> This, however, seems more testable.  Do you think a piglit could be
> created that could trigger this?

Possibly...but it's a bit tricky to cause the batch to grow.  You have
to issue a draw call that emits a lot of commands at just the right
moment (near the end of the batch).  State (like lots of textures)
doesn't help either...you need commands.

That's sort of why I added the debugging code in patch 6 - by hacking
the driver to go through the motions of growing on every batch, we turn
every program into a stress test.  Hacking the driver isn't ideal
though...

Overflowing the state buffer is easier, we could probably do that with
a test that draws using a maximal number of surfaces.  I think I've seen
the CTS hit this already, depending how exactly it was sharded.
-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 833 bytes
Desc: This is a digitally signed message part.
URL: <https://lists.freedesktop.org/archives/mesa-dev/attachments/20171128/2b09dc6a/attachment.sig>