[Intel-gfx] [RFC 00/11] TDR/watchdog timeout support for gen8

Thu Jul 9 11:47:35 PDT 2015

On Mon, Jun 08, 2015 at 06:03:18PM +0100, Tomas Elf wrote:
> This patch series introduces the following features:
> 
> * Feature 1: TDR (Timeout Detection and Recovery) for gen8 execlist mode.
> * Feature 2: Watchdog Timeout (a.k.a "media engine reset") for gen8.
> * Feature 3. Context Submission Status Consistency checking

The high level design is reasonable and conceptually extends the current
system in fairly obvious ways.

In terms of discussing the implementation, I think this can be phased
as:

0. Move to a per-engine hangcheck

1. Add fake-interrupt recovery for CSSC

  I think this can be done without changing the hangcheck heuristics at
  all - we just need to think carefully about recovery (which is a nice
  precursor to per-engine reset). I may be wrong, and so would like to
  be told so early on! If the fake interrupt recovery is done as part of
  the reset handler, we should have (one) fewer concurrency issues to
  worry about. (In a similar vein, I think we should move the missed
  interupt handler for legacy out of hangcheck, partly to simplify some
  very confusing code and partly so that we have fewer deviations
  between legacy/execlists paths.) It also gets us thinking about the
  consistency detection and when it is viable to do a fake-interrupt and
  when we must do a full-reset (for example, we only want to
  fake-interrupt if the consistency check says the GPU is idle, and we
  definitely want to reset everything if the GPU is executing an alien
  context.)

  A test here would be to suspend the execlists irq and wait for the
  recovery. Cheekily we could punch the irq eir by root mmio and check
  hangcheck automagically recovers.

Whilst it would be nice to add the watchdog next, since it is
conceptually quite simple and basically just a new hangcheck source with
fancy setup - fast hangcheck without soft reset makes for an easy DoS.

2. TDR

  Given that we have a consistency check and begun to extend the reset
  path, we can implement a soft reset that only skips the hung request.
  (The devil is in the details, whilst the design here looked solid, I
  think the LRC recovery code could be simplified - I didn't feel
  another requeue path was required given that we need only pretend a
  request completed (fixing up the context image as required) and then
  use the normal unqueue.) There is also quite a bit of LRC cleanup on
  the lists which would be useful here.

  Lots of tests for concurrent engine utilisation, multiple contexts,
  etc and ensuring that a hang in one does not affect independent work
  (engines, batches, contexts).

3. Watchdog

  A new fast hangcheck source. So concurrency, promotion (relative
  scoring between watchdog / hangcheck) and issue with not programming
  the ring correctly (beware the interrupt after performing a dispatch
  and programming the tail, needs either reserved space, inlining into
  the dispatch etc).

  The only thing of remark here is the uapi. It is a server feature
  (interactive clients are less likely to tolerate data being thrown
  away). Do we want an execbuf bit or context param? An execbuf bit
  allows switching between two watchdog timers (or on/off), but we
  have many more context params available than bits. Do we want to
  expose context params to set the fast/slow timeouts?

  We haven't found any GL spec that descibe controllable watchdogs, so
  the ultimate uapi requirements are unknown. My feeling is that we want
  to do the initial uapi through context params, and start with a single
  fast watchdog timeout value. This is sufficient for testing and
  probably for most use cases. This can easily be extended by adding an
  execbuf bit to switch between two values and a context param to set
  the second value. (If the second value isn't set, the execbuf bit
  dosn't do anything the watchdog is always programmed to the user
  value. If the second value is set (maybe infinite), then the execbuf
  bit is used to select which timeout to use for that batch.) But given
  that this is a server-esque feature, it is likely to be a setting the
  userspace driver imposes upon its clients, and there is unlikely to be
  the need to switch timeouts within any one client.

  The tests here would focus on the uapi and ensuring that if the client
  asks for a 60ms hang detection, then all work packets take at most
  60ms to complete. Hmm, add wait_ioctl to the list of uapi that should
  report -EIO.

>  drivers/gpu/drm/i915/i915_debugfs.c     |  146 +++++-
>  drivers/gpu/drm/i915/i915_drv.c         |  201 ++++++++
>  drivers/gpu/drm/i915/intel_lrc.c        |  858 ++++++++++++++++++++++++++++++-

The balance here feels wrong ;-)
-Chris

-- 
Chris Wilson, Intel Open Source Technology Centre