[Mesa-dev] [PATCH 5/6] i965: Avoid problems from referencing orphaned BOs after growing.
Kenneth Graunke
kenneth at whitecape.org
Wed Nov 29 02:50:24 UTC 2017
On Tuesday, November 28, 2017 6:34:20 PM PST Ian Romanick wrote:
> On 11/28/2017 04:13 PM, Kenneth Graunke wrote:
> > diff --git a/src/mesa/drivers/dri/i965/intel_batchbuffer.c b/src/mesa/drivers/dri/i965/intel_batchbuffer.c
> > index b30ffe5d7f6..12d165d7236 100644
> > --- a/src/mesa/drivers/dri/i965/intel_batchbuffer.c
> > +++ b/src/mesa/drivers/dri/i965/intel_batchbuffer.c
> > @@ -183,6 +183,8 @@ intel_batchbuffer_reset(struct brw_context *brw)
> > batch->last_bo = batch->batch.bo;
> >
> > batch->batch.bo = brw_bo_alloc(bufmgr, "batchbuffer", BATCH_SZ, 4096);
> > + batch->batch.partial_bo = 0;
>
> Use 0 or NULL in both places (*.partial_bo in this hunk and the next).
Oops, meant to use NULL. Fixed locally, thanks.
> > + /* Exchange the two BOs...without breaking pointers to the old BO.
> > + *
> > + * Consider this scenario:
> > + *
> > + * 1. Somebody calls brw_state_batch() to get a region of memory, and
> > + * and then creates a brw_address pointing to brw->batch.state.bo.
> > + * 2. They then call brw_state_batch() a second time, which happens to
> > + * grow and replace the state buffer. They then try to emit a
> > + * relocation to their first section of memory.
> > + *
> > + * If we replace the brw->batch.state.bo pointer at step 2, we would
> > + * break the address created in step 1. They'd have a pointer to the
> > + * old destroyed BO. Emitting a relocation would add this dead BO to
> > + * the validation list...causing /both/ statebuffers to be in the list,
> > + * and all kinds of disasters.
> > + *
> > + * This is not a contrived case - BLORP vertex data upload hits this.
>
> Is this the source of the DiRT GPU hangs?
I'm unclear on that. I tracked this down by hacking brw_state_batch to
"grow" the buffer on every call, so that we hit this path all the time.
I then used glsl-algebraic-add-zero and glxgears as test cases, and
found the BLORP vertex buffer problem there. I worked around that
problem and ran into others, though, that I didn't track down.
That's why I went this route of "make it work" instead of trying to
adjust every caller to do a safe thing. It's hard to find them all
and ensure they follow a non-obvious set of ordering constraints.
When I'd fixed the problem, it fixed DiRT. But I'm not sure which
exact piece of state was hitting the issue.
> > + *
> > + * There are worse scenarios too. Fences for GL sync objects reference
> > + * brw->batch.batch.bo. If we replaced the batch pointer when growing,
> > + * we'd need to chase down every fence and update it to point to the
> > + * new BO. Otherwise, it would refer to a "batch" that never actually
> > + * gets submitted, and would fail to trigger.
>
> The BLORP case seems really, really hard to test in a reliable way.
> This, however, seems more testable. Do you think a piglit could be
> created that could trigger this?
Possibly...but it's a bit tricky to cause the batch to grow. You have
to issue a draw call that emits a lot of commands at just the right
moment (near the end of the batch). State (like lots of textures)
doesn't help either...you need commands.
That's sort of why I added the debugging code in patch 6 - by hacking
the driver to go through the motions of growing on every batch, we turn
every program into a stress test. Hacking the driver isn't ideal
though...
Overflowing the state buffer is easier, we could probably do that with
a test that draws using a maximal number of surfaces. I think I've seen
the CTS hit this already, depending how exactly it was sharded.
-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 833 bytes
Desc: This is a digitally signed message part.
URL: <https://lists.freedesktop.org/archives/mesa-dev/attachments/20171128/2b09dc6a/attachment.sig>
More information about the mesa-dev
mailing list