[Mesa-dev] [PATCH 0/7] i965: Stop hanging on Haswell

Wed Jun 14 15:55:08 UTC 2017

On Wed, Jun 14, 2017 at 2:00 AM, Chris Wilson <chris at chris-wilson.co.uk>
wrote:

> Quoting Jason Ekstrand (2017-06-13 22:53:20)
> > As I've been working on converting more things in the GL driver over to
> > blorp, I've been highly annoyed by all of the hangs on Haswell.  About
> one
> > in 3-5 Jenkins runs would hang somewhere.  After looking at about a
> > half-dozen error states, I noticed that all of the hangs seemed to be on
> > fast-clear operations (clear or resolve) that happen at the start of a
> > batch, right after STATE_BASE_ADDRESS.
> >
> > Haswell seems to be a bit more picky than other hardware about having
> > fast-clear operations in flight at the same time as regular rendering and
> > hangs if the two ever overlap.  (Other hardware can get rendering
> > corruption but not usually hangs.)  Also, Haswell doesn't fully stall if
> > you just do a RT flush and a CS stall.  The hardware docs refer to
> > something they call an "end of pipe sync" which is a CS stall with a
> write
> > to the workaround BO.  On Haswell, you also need to read from that same
> > address to create a memory dependency and make sure the system is fully
> > stalled.
> >
> > When you call brw_blorp_resolve_color it calls
> brw_emit_pipe_control_flush
> > and does the correct flushes and then calls into core blorp to do the
> > actual resolve operation.  If the batch doesn't have enough space left in
> > it for the fast-clear operation, the batch will get split and the
> > fast-clear will happen in the next batch.  I believe what is happening is
> > that while we're building the second batch that actually contains the
> > fast-clear, some other process completes a batch and inserts it between
> our
> > PIPE_CONTROL to do the stall and the actual fast-clear.  We then end up
> > with more stuff in flight than we can handle and the GPU explodes.
> >
> > I'm not 100% convinced of this explanation because it seems a bit fishy
> > that a context switch wouldn't be enough to fully flush out the GPU.
> > However, what I do know is that, without these patches I get a hang in
> one
> > out of three to five Jenkins runs on my wip/i965-blorp-ds branch.  With
> the
> > patches (or an older variant that did the same thing), I have done
> almost 20
> > Jenkins runs and have yet to see a hang.  I'd call that success.
>

For the record, I *think* this also improves Sky Lake.  I believe I saw
hangs (less often, maybe 1 in 10) without this and have seen none with it.

> Note that a context switch is itself just a batch that restores the
> registers
> and GPU state.
>
> The kernel does
>
>         PIPE_CONTROLs for invalidate-caches
>         MI_SET_CONTEXT
>         MI_BB_START
>         PIPE_CONTROLs for flush-caches
>         MI_STORE_DWORD (seqno)
>         MI_USER_INTERRUPT
>
> What I believe you are seeing is that MI_SET_CONTEXT is leaving the GPU
> in an active state requiring a pipeline barrier before adjusting. It
> will be the equivalent of switching between GL and blorp in the middle of
> a batch.
>

That's also a reasonable theory (or maybe even better).  However, the
work-around is the same.

> The question I have is whether we apply the fix in the kernel, i.e. do a
> full end of pipe sync after every MI_SET_CONTEXT. Userspace has the
> advantage of knowing if/when such a hammer is required, but equally we
> have to learn where by trial-and-error and if a second context user ever
> manifests, they will have to be taught the same lessons.
>

Right.

Here's arguments for doing it in the kernel:

 1) It's the "right" place to do it because it appears to be a
cross-context issue.
 2) The kernel knows whether or not you're getting an actual context switch
and can insert the end-of-pipe sync when an actual context switch happens
rather than on every batch.

Here's arguments for doing it in userspace:

 1) Userspace knows whether or not we're doing an actual fast-clear
operation and can only flush for fast-clears at the beginning of the batch.
 2) The kernel isn't flushing now so we'll get hangs until people update
kernels unless we do it in userspace.

My gut says userspace but that's because I tend to have a mild distrust of
the kernel.  There are some things that are the kernel's job (dealing with
context switches, for instance) but I'm a big fan of putting anything in
userspace that can reasonably go there.

Here's some more data.  Knowing this was a big giant hammer, I ran a full
suite of benchmarks overnight on my Haswell GT3 and this is what I found:

Test                   0-master     1-i965-end-of-pipe     diff
bench_manhattan        4442.510     4430.870               -11.640
bench_manhattanoff     4683.300     4663.000               -20.300
bench_OglBatch0        773.523      771.027                -2.496
bench_OglBatch1        775.858      771.802                -4.056
bench_OglBatch4        747.629      745.522                -2.107
bench_OglPSBump2       513.528      514.944                1.416

So the only statistically different things were manhattan and some batch
tests and all by around 0.5% or less which may easily have been noise
(though ministat seems to think it's significant).  So, if the big hammer
is hurting us, it's not hurting us badly.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://lists.freedesktop.org/archives/mesa-dev/attachments/20170614/66b1c6b3/attachment.html>