[Intel-gfx] [RFC 00/11] TDR/watchdog timeout support for gen8
Chris Wilson
chris at chris-wilson.co.uk
Thu Jul 9 11:47:35 PDT 2015
On Mon, Jun 08, 2015 at 06:03:18PM +0100, Tomas Elf wrote:
> This patch series introduces the following features:
>
> * Feature 1: TDR (Timeout Detection and Recovery) for gen8 execlist mode.
> * Feature 2: Watchdog Timeout (a.k.a "media engine reset") for gen8.
> * Feature 3. Context Submission Status Consistency checking
The high level design is reasonable and conceptually extends the current
system in fairly obvious ways.
In terms of discussing the implementation, I think this can be phased
as:
0. Move to a per-engine hangcheck
1. Add fake-interrupt recovery for CSSC
I think this can be done without changing the hangcheck heuristics at
all - we just need to think carefully about recovery (which is a nice
precursor to per-engine reset). I may be wrong, and so would like to
be told so early on! If the fake interrupt recovery is done as part of
the reset handler, we should have (one) fewer concurrency issues to
worry about. (In a similar vein, I think we should move the missed
interupt handler for legacy out of hangcheck, partly to simplify some
very confusing code and partly so that we have fewer deviations
between legacy/execlists paths.) It also gets us thinking about the
consistency detection and when it is viable to do a fake-interrupt and
when we must do a full-reset (for example, we only want to
fake-interrupt if the consistency check says the GPU is idle, and we
definitely want to reset everything if the GPU is executing an alien
context.)
A test here would be to suspend the execlists irq and wait for the
recovery. Cheekily we could punch the irq eir by root mmio and check
hangcheck automagically recovers.
Whilst it would be nice to add the watchdog next, since it is
conceptually quite simple and basically just a new hangcheck source with
fancy setup - fast hangcheck without soft reset makes for an easy DoS.
2. TDR
Given that we have a consistency check and begun to extend the reset
path, we can implement a soft reset that only skips the hung request.
(The devil is in the details, whilst the design here looked solid, I
think the LRC recovery code could be simplified - I didn't feel
another requeue path was required given that we need only pretend a
request completed (fixing up the context image as required) and then
use the normal unqueue.) There is also quite a bit of LRC cleanup on
the lists which would be useful here.
Lots of tests for concurrent engine utilisation, multiple contexts,
etc and ensuring that a hang in one does not affect independent work
(engines, batches, contexts).
3. Watchdog
A new fast hangcheck source. So concurrency, promotion (relative
scoring between watchdog / hangcheck) and issue with not programming
the ring correctly (beware the interrupt after performing a dispatch
and programming the tail, needs either reserved space, inlining into
the dispatch etc).
The only thing of remark here is the uapi. It is a server feature
(interactive clients are less likely to tolerate data being thrown
away). Do we want an execbuf bit or context param? An execbuf bit
allows switching between two watchdog timers (or on/off), but we
have many more context params available than bits. Do we want to
expose context params to set the fast/slow timeouts?
We haven't found any GL spec that descibe controllable watchdogs, so
the ultimate uapi requirements are unknown. My feeling is that we want
to do the initial uapi through context params, and start with a single
fast watchdog timeout value. This is sufficient for testing and
probably for most use cases. This can easily be extended by adding an
execbuf bit to switch between two values and a context param to set
the second value. (If the second value isn't set, the execbuf bit
dosn't do anything the watchdog is always programmed to the user
value. If the second value is set (maybe infinite), then the execbuf
bit is used to select which timeout to use for that batch.) But given
that this is a server-esque feature, it is likely to be a setting the
userspace driver imposes upon its clients, and there is unlikely to be
the need to switch timeouts within any one client.
The tests here would focus on the uapi and ensuring that if the client
asks for a 60ms hang detection, then all work packets take at most
60ms to complete. Hmm, add wait_ioctl to the list of uapi that should
report -EIO.
> drivers/gpu/drm/i915/i915_debugfs.c | 146 +++++-
> drivers/gpu/drm/i915/i915_drv.c | 201 ++++++++
> drivers/gpu/drm/i915/intel_lrc.c | 858 ++++++++++++++++++++++++++++++-
The balance here feels wrong ;-)
-Chris
--
Chris Wilson, Intel Open Source Technology Centre
More information about the Intel-gfx
mailing list