[Mesa-dev] [PATCH 0/7] i965: Stop hanging on Haswell

Jason Ekstrand jason at jlekstrand.net
Thu Jun 15 15:58:13 UTC 2017


On Thu, Jun 15, 2017 at 4:15 AM, Chris Wilson <chris at chris-wilson.co.uk>
wrote:

> Quoting Kenneth Graunke (2017-06-14 21:44:45)
> > On Tuesday, June 13, 2017 2:53:20 PM PDT Jason Ekstrand wrote:
> > > As I've been working on converting more things in the GL driver over to
> > > blorp, I've been highly annoyed by all of the hangs on Haswell.  About
> one
> > > in 3-5 Jenkins runs would hang somewhere.  After looking at about a
> > > half-dozen error states, I noticed that all of the hangs seemed to be
> on
> > > fast-clear operations (clear or resolve) that happen at the start of a
> > > batch, right after STATE_BASE_ADDRESS.
> > >
> > > Haswell seems to be a bit more picky than other hardware about having
> > > fast-clear operations in flight at the same time as regular rendering
> and
> > > hangs if the two ever overlap.  (Other hardware can get rendering
> > > corruption but not usually hangs.)  Also, Haswell doesn't fully stall
> if
> > > you just do a RT flush and a CS stall.  The hardware docs refer to
> > > something they call an "end of pipe sync" which is a CS stall with a
> write
> > > to the workaround BO.  On Haswell, you also need to read from that same
> > > address to create a memory dependency and make sure the system is fully
> > > stalled.
> > >
> > > When you call brw_blorp_resolve_color it calls
> brw_emit_pipe_control_flush
> > > and does the correct flushes and then calls into core blorp to do the
> > > actual resolve operation.  If the batch doesn't have enough space left
> in
> > > it for the fast-clear operation, the batch will get split and the
> > > fast-clear will happen in the next batch.  I believe what is happening
> is
> > > that while we're building the second batch that actually contains the
> > > fast-clear, some other process completes a batch and inserts it
> between our
> > > PIPE_CONTROL to do the stall and the actual fast-clear.  We then end up
> > > with more stuff in flight than we can handle and the GPU explodes.
> > >
> > > I'm not 100% convinced of this explanation because it seems a bit fishy
> > > that a context switch wouldn't be enough to fully flush out the GPU.
> > > However, what I do know is that, without these patches I get a hang in
> one
> > > out of three to five Jenkins runs on my wip/i965-blorp-ds branch.
> With the
> > > patches (or an older variant that did the same thing), I have done
> almost 20
> > > Jenkins runs and have yet to see a hang.  I'd call that success.
> > >
> > > Jason Ekstrand (6):
> > >   i965: Flush around state base address
> > >   i965: Take a uint64_t immediate in emit_pipe_control_write
> > >   i965: Unify the two emit_pipe_control functions
> > >   i965: Do an end-of-pipe sync prior to STATE_BASE_ADDRESS
> > >   i965/blorp: Do an end-of-pipe sync around CCS ops
> > >   i965: Do an end-of-pipe sync after flushes
> > >
> > > Topi Pohjolainen (1):
> > >   i965: Add an end-of-pipe sync helper
> > >
> > >  src/mesa/drivers/dri/i965/brw_blorp.c        |  16 +-
> > >  src/mesa/drivers/dri/i965/brw_context.h      |   3 +-
> > >  src/mesa/drivers/dri/i965/brw_misc_state.c   |  38 +++++
> > >  src/mesa/drivers/dri/i965/brw_pipe_control.c | 243
> ++++++++++++++++++---------
> > >  src/mesa/drivers/dri/i965/brw_queryobj.c     |   5 +-
> > >  src/mesa/drivers/dri/i965/gen6_queryobj.c    |   2 +-
> > >  src/mesa/drivers/dri/i965/genX_blorp_exec.c  |   2 +-
> > >  7 files changed, 211 insertions(+), 98 deletions(-)
> > >
> > >
> >
> > The series is:
> > Reviewed-by: Kenneth Graunke <kenneth at whitecape.org>
> >
> > If Chris is right, and what we're really seeing is that MI_SET_CONTEXT
> > needs additional flushing, it probably makes sense to fix the kernel.
> > If it's really fast clear related, then we should do it in Mesa.
>
> If I'm right, it's more of a userspace problem because you have to
> insert a pipeline stall before STATE_BASE_ADDRESS when switching between
> blorp/normal and back again, in the same batch. That the MI_SET_CONTEXT
> may be restoring the dirty GPU state from the previous batch just means
> that
> you have to think of batches as being one long continuous batch.
> -Chris
>

 Given that, I doubt your explanation is correct.  Right now, we should be
correct under the "long continuous batch" assumption and we're hanging.  So
I think that either MI_SET_CONTEXT doesn't stall hard enough or we're
conflicting with another process somehow.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://lists.freedesktop.org/archives/mesa-dev/attachments/20170615/d146ea8b/attachment.html>


More information about the mesa-dev mailing list