[Intel-gfx] [PATCH 5/8] drm/i915: Double check hangcheck.seqno after reset

Mon Oct 3 14:01:40 UTC 2016

On Mon, Oct 03, 2016 at 04:14:39PM +0300, Mika Kuoppala wrote:
> Chris Wilson <chris at chris-wilson.co.uk> writes:
> 
> > Check that there was not a late recovery between us declaring the GPU
> > hung and processing the reset. If the GPU did recover by itself, let the
> > request remain on the active list and see if it hangs again!
> >
> 
> Did you see this in action? Makes sense to recheck
> after reset. I don't remember how TDR will deal with multiple
> reset on the same engine but we should start tracking the seqno
> that cause it and make sure we don't get stuck by replaying the same.

Not observed. All my testing is with true hangs, not slow batches at
~hangcheck interval, i.e. nothing that would unstick between us
detecting the hang and sending the reset request.

> Do we check the banning on resubmission and/or do we trust that
> the breadcrumb update always succeedes?

Breadcrumb update has to succeed as we are not the only observers.
Issuing it from the GPU is the only way we can do the reset without
shooting ourselves in the foot with nasty races (and general explosions
from use-after-free).

> I envision that if we get multiple resets on same seqno, we
> just write the breadcrumbs through cpu and move on. But let's
> hope we don't need to and the gpu breadcrumps are always enough.

Can't do. The only time we can take over with the GPU is if we declare
the machine wedged and drain everything and all observers (including
external). That is quite impractical. :|
-Chris

-- 
Chris Wilson, Intel Open Source Technology Centre