[Mesa-dev] [PATCH 0/7] i965: Stop hanging on Haswell
Chris Wilson
chris at chris-wilson.co.uk
Wed Jun 14 09:00:37 UTC 2017
Quoting Jason Ekstrand (2017-06-13 22:53:20)
> As I've been working on converting more things in the GL driver over to
> blorp, I've been highly annoyed by all of the hangs on Haswell. About one
> in 3-5 Jenkins runs would hang somewhere. After looking at about a
> half-dozen error states, I noticed that all of the hangs seemed to be on
> fast-clear operations (clear or resolve) that happen at the start of a
> batch, right after STATE_BASE_ADDRESS.
>
> Haswell seems to be a bit more picky than other hardware about having
> fast-clear operations in flight at the same time as regular rendering and
> hangs if the two ever overlap. (Other hardware can get rendering
> corruption but not usually hangs.) Also, Haswell doesn't fully stall if
> you just do a RT flush and a CS stall. The hardware docs refer to
> something they call an "end of pipe sync" which is a CS stall with a write
> to the workaround BO. On Haswell, you also need to read from that same
> address to create a memory dependency and make sure the system is fully
> stalled.
>
> When you call brw_blorp_resolve_color it calls brw_emit_pipe_control_flush
> and does the correct flushes and then calls into core blorp to do the
> actual resolve operation. If the batch doesn't have enough space left in
> it for the fast-clear operation, the batch will get split and the
> fast-clear will happen in the next batch. I believe what is happening is
> that while we're building the second batch that actually contains the
> fast-clear, some other process completes a batch and inserts it between our
> PIPE_CONTROL to do the stall and the actual fast-clear. We then end up
> with more stuff in flight than we can handle and the GPU explodes.
>
> I'm not 100% convinced of this explanation because it seems a bit fishy
> that a context switch wouldn't be enough to fully flush out the GPU.
> However, what I do know is that, without these patches I get a hang in one
> out of three to five Jenkins runs on my wip/i965-blorp-ds branch. With the
> patches (or an older variant that did the same thing), I have done almost 20
> Jenkins runs and have yet to see a hang. I'd call that success.
Note that a context switch is itself just a batch that restores the registers
and GPU state.
The kernel does
PIPE_CONTROLs for invalidate-caches
MI_SET_CONTEXT
MI_BB_START
PIPE_CONTROLs for flush-caches
MI_STORE_DWORD (seqno)
MI_USER_INTERRUPT
What I believe you are seeing is that MI_SET_CONTEXT is leaving the GPU
in an active state requiring a pipeline barrier before adjusting. It
will be the equivalent of switching between GL and blorp in the middle of
a batch.
The question I have is whether we apply the fix in the kernel, i.e. do a
full end of pipe sync after every MI_SET_CONTEXT. Userspace has the
advantage of knowing if/when such a hammer is required, but equally we
have to learn where by trial-and-error and if a second context user ever
manifests, they will have to be taught the same lessons.
-Chris
More information about the mesa-dev
mailing list