[Mesa-dev] [PATCH 0/7] i965: Stop hanging on Haswell

Kenneth Graunke kenneth at whitecape.org
Wed Jun 14 20:44:45 UTC 2017


On Tuesday, June 13, 2017 2:53:20 PM PDT Jason Ekstrand wrote:
> As I've been working on converting more things in the GL driver over to
> blorp, I've been highly annoyed by all of the hangs on Haswell.  About one
> in 3-5 Jenkins runs would hang somewhere.  After looking at about a
> half-dozen error states, I noticed that all of the hangs seemed to be on
> fast-clear operations (clear or resolve) that happen at the start of a
> batch, right after STATE_BASE_ADDRESS.
> 
> Haswell seems to be a bit more picky than other hardware about having
> fast-clear operations in flight at the same time as regular rendering and
> hangs if the two ever overlap.  (Other hardware can get rendering
> corruption but not usually hangs.)  Also, Haswell doesn't fully stall if
> you just do a RT flush and a CS stall.  The hardware docs refer to
> something they call an "end of pipe sync" which is a CS stall with a write
> to the workaround BO.  On Haswell, you also need to read from that same
> address to create a memory dependency and make sure the system is fully
> stalled.
> 
> When you call brw_blorp_resolve_color it calls brw_emit_pipe_control_flush
> and does the correct flushes and then calls into core blorp to do the
> actual resolve operation.  If the batch doesn't have enough space left in
> it for the fast-clear operation, the batch will get split and the
> fast-clear will happen in the next batch.  I believe what is happening is
> that while we're building the second batch that actually contains the
> fast-clear, some other process completes a batch and inserts it between our
> PIPE_CONTROL to do the stall and the actual fast-clear.  We then end up
> with more stuff in flight than we can handle and the GPU explodes.
> 
> I'm not 100% convinced of this explanation because it seems a bit fishy
> that a context switch wouldn't be enough to fully flush out the GPU.
> However, what I do know is that, without these patches I get a hang in one
> out of three to five Jenkins runs on my wip/i965-blorp-ds branch.  With the
> patches (or an older variant that did the same thing), I have done almost 20
> Jenkins runs and have yet to see a hang.  I'd call that success.
> 
> Jason Ekstrand (6):
>   i965: Flush around state base address
>   i965: Take a uint64_t immediate in emit_pipe_control_write
>   i965: Unify the two emit_pipe_control functions
>   i965: Do an end-of-pipe sync prior to STATE_BASE_ADDRESS
>   i965/blorp: Do an end-of-pipe sync around CCS ops
>   i965: Do an end-of-pipe sync after flushes
> 
> Topi Pohjolainen (1):
>   i965: Add an end-of-pipe sync helper
> 
>  src/mesa/drivers/dri/i965/brw_blorp.c        |  16 +-
>  src/mesa/drivers/dri/i965/brw_context.h      |   3 +-
>  src/mesa/drivers/dri/i965/brw_misc_state.c   |  38 +++++
>  src/mesa/drivers/dri/i965/brw_pipe_control.c | 243 ++++++++++++++++++---------
>  src/mesa/drivers/dri/i965/brw_queryobj.c     |   5 +-
>  src/mesa/drivers/dri/i965/gen6_queryobj.c    |   2 +-
>  src/mesa/drivers/dri/i965/genX_blorp_exec.c  |   2 +-
>  7 files changed, 211 insertions(+), 98 deletions(-)
> 
> 

The series is:
Reviewed-by: Kenneth Graunke <kenneth at whitecape.org>

If Chris is right, and what we're really seeing is that MI_SET_CONTEXT
needs additional flushing, it probably makes sense to fix the kernel.
If it's really fast clear related, then we should do it in Mesa.

I'm not sure we'll ever be able to properly determine that.

Even if we go the kernel route, we should land patches 1-3.
-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 833 bytes
Desc: This is a digitally signed message part.
URL: <https://lists.freedesktop.org/archives/mesa-dev/attachments/20170614/89266e69/attachment.sig>


More information about the mesa-dev mailing list