[Mesa-dev] [PATCH 0/7] i965: Stop hanging on Haswell

Jason Ekstrand jason at jlekstrand.net
Tue Jun 13 21:53:20 UTC 2017

As I've been working on converting more things in the GL driver over to
blorp, I've been highly annoyed by all of the hangs on Haswell.  About one
in 3-5 Jenkins runs would hang somewhere.  After looking at about a
half-dozen error states, I noticed that all of the hangs seemed to be on
fast-clear operations (clear or resolve) that happen at the start of a
batch, right after STATE_BASE_ADDRESS.

Haswell seems to be a bit more picky than other hardware about having
fast-clear operations in flight at the same time as regular rendering and
hangs if the two ever overlap.  (Other hardware can get rendering
corruption but not usually hangs.)  Also, Haswell doesn't fully stall if
you just do a RT flush and a CS stall.  The hardware docs refer to
something they call an "end of pipe sync" which is a CS stall with a write
to the workaround BO.  On Haswell, you also need to read from that same
address to create a memory dependency and make sure the system is fully

When you call brw_blorp_resolve_color it calls brw_emit_pipe_control_flush
and does the correct flushes and then calls into core blorp to do the
actual resolve operation.  If the batch doesn't have enough space left in
it for the fast-clear operation, the batch will get split and the
fast-clear will happen in the next batch.  I believe what is happening is
that while we're building the second batch that actually contains the
fast-clear, some other process completes a batch and inserts it between our
PIPE_CONTROL to do the stall and the actual fast-clear.  We then end up
with more stuff in flight than we can handle and the GPU explodes.

I'm not 100% convinced of this explanation because it seems a bit fishy
that a context switch wouldn't be enough to fully flush out the GPU.
However, what I do know is that, without these patches I get a hang in one
out of three to five Jenkins runs on my wip/i965-blorp-ds branch.  With the
patches (or an older variant that did the same thing), I have done almost 20
Jenkins runs and have yet to see a hang.  I'd call that success.

Jason Ekstrand (6):
  i965: Flush around state base address
  i965: Take a uint64_t immediate in emit_pipe_control_write
  i965: Unify the two emit_pipe_control functions
  i965: Do an end-of-pipe sync prior to STATE_BASE_ADDRESS
  i965/blorp: Do an end-of-pipe sync around CCS ops
  i965: Do an end-of-pipe sync after flushes

Topi Pohjolainen (1):
  i965: Add an end-of-pipe sync helper

 src/mesa/drivers/dri/i965/brw_blorp.c        |  16 +-
 src/mesa/drivers/dri/i965/brw_context.h      |   3 +-
 src/mesa/drivers/dri/i965/brw_misc_state.c   |  38 +++++
 src/mesa/drivers/dri/i965/brw_pipe_control.c | 243 ++++++++++++++++++---------
 src/mesa/drivers/dri/i965/brw_queryobj.c     |   5 +-
 src/mesa/drivers/dri/i965/gen6_queryobj.c    |   2 +-
 src/mesa/drivers/dri/i965/genX_blorp_exec.c  |   2 +-
 7 files changed, 211 insertions(+), 98 deletions(-)


