[Intel-gfx] [RFC 00/11] TDR/watchdog timeout support for gen8

Chris Wilson chris at chris-wilson.co.uk
Sat Jul 11 11:22:28 PDT 2015


On Fri, Jul 10, 2015 at 04:24:33PM +0100, Tomas Elf wrote:
> On 09/07/2015 19:47, Chris Wilson wrote:
> >On Mon, Jun 08, 2015 at 06:03:18PM +0100, Tomas Elf wrote:
> >>This patch series introduces the following features:
> >>
> >>* Feature 1: TDR (Timeout Detection and Recovery) for gen8 execlist mode.
> >>* Feature 2: Watchdog Timeout (a.k.a "media engine reset") for gen8.
> >>* Feature 3. Context Submission Status Consistency checking
> >
> >The high level design is reasonable and conceptually extends the current
> >system in fairly obvious ways.
> >
> >In terms of discussing the implementation, I think this can be phased
> >as:
> >
> >0. Move to a per-engine hangcheck
> >
> >1. Add fake-interrupt recovery for CSSC
> >
> >   I think this can be done without changing the hangcheck heuristics at
> >   all - we just need to think carefully about recovery (which is a nice
> >   precursor to per-engine reset). I may be wrong, and so would like to
> >   be told so early on! If the fake interrupt recovery is done as part of
> >   the reset handler, we should have (one) fewer concurrency issues to
> >   worry about.
> 
> Some points about moving the CSSC out of the hang checker and into
> the reset handler:
> 
> 1. If we deal with consistency rectification in the reset handler
> the turnaround time becomes REALLY long:
> 		
> 	a. First you have the time to detect the hang, call
> i915_handle_error() that then raises the reset in progress flag,
> preventing further submissions to the driver.
> 
> 	b. Then go all the way to the per-engine recovery path, only to
> discover that we've got an inconsistency that has not been handled,
> fall back immediately with -EAGAIN and lower the reset in progress
> flag, let the system continue running and defer to the next hang
> check (another hang detection period)

Urm, there is no per-engine recovery at this phase. And even if there were,
why would you do that if you could detect the inconsistency and just
issue the fake interrupt? This is why I think phasing it in this way
makes it more obvious about the choices we can make during reset
handling.

[snip]

> 2. Secondly, and more importantly, if a watchdog timeout is detected
> and we end up in the per-engine hang recovery path and have to fall
> back due to an inconsistent context submission state at that point
> and the hang checker is turned off then we're irrecoverably hung.
> Watchdog timeout is supposed to work without the periodic hang
> checker but it won't if CSSC is not ensured at all times. Which is
> why I chose to override the i915.enable_hangcheck flag to make sure
> that the hang checker always runs consistency pre-checking and
> reschedules itself if there is more work pending to make sure that
> as long as work is pending we do consistency checking asynchronously
> regardless of everything else so that if a watchdog timeout hits we
> have a consistent state once the watchdog timeout ends up in
> per-engine recovery.

Again, what? If the watchdog mechanism is just a faster way to detect
hangs and queue the reset task, what the reset does is then independent
of how we detect the hang.
 
> Granted, if a watchdog timeout hits after we've first detected the
> inconsistency but not yet had time to rectify it it doesn't work if
> the hang checker is turned off and we cannot rely on periodic hang
> checking to schedule hang recovery in this case - so in that case
> we're still irrecoverably stuck. We could make change here and do a
> one-time i915.enable_hangcheck override and schedule hang recovery
> following this point. If you think it's worth it.
> 	
> Bottom line: The consistency checking must happen at all times and
> cannot be done as a consequence of a scheduled reset if hang
> checking is turned off at any point.

I strongly disagree. I think of this as two unique tasks, (a) detect
hang and (b) do something about it. At the moment (b) isn't smart
enough to work out the minimum it has to do in order to recover.

Which is why I suggest you think about adding the simpler fake interrupt
into the reset path first, before you even think about doing TDR.
-Chris

-- 
Chris Wilson, Intel Open Source Technology Centre


More information about the Intel-gfx mailing list