[Intel-gfx] [PATCH 00/20] TDR/watchdog support for gen8

Thu Oct 22 18:32:22 PDT 2015

This patch series introduces the following features:

* Feature 1: TDR (Timeout Detection and Recovery) for gen8 execlist mode.

TDR is an umbrella term for anything that goes into detecting and recovering
from GPU hangs and is a term more widely used outside of the i915 driver.

This feature introduces an extensible framework that currently supports gen8
but that can be easily extended to support gen7 as well (which is already the
case in Android but unfortunately in a not quite upstreamable form). The code
contained in this submission represents the essentials of what is currently in
Android merged with what is currently in upstream (as of the time when this
work commenced a few months back).

A new hang recovery path has been added alongside the legacy GPU reset path,
which takes care of engine recovery only. Aside from adding support for
per-engine recovery this feature also introduces rules for how to promote a
potential per-engine reset to a legacy, full GPU reset.

The hang checker now integrates with the error handler in a slightly different
way in that it allows hang recovery on multiple engines at the same time by
passing an engine flag mask to the error handler where flags representing all
of the hung engines are set. This allows us to schedule hang recovery once for
all currently hung engines instead of one hang recovery per detected engine
hang. Previously, when only full GPU reset was supported this was all the same
since it wouldn't matter if one or four engines were hung at any given point
since it would all amount to the same thing - the GPU getting reset. As it
stands now the behaviour is different depending on which engine is hung since
each engine is reset independently of all the other engines, therefore we have
to think about this in terms of scheduling cost and recovery latency.  (see the
open questions section below)

* Feature 2: Watchdog Timeout (a.k.a "media engine reset") for gen8.

This feature allows userland applications to control whether or not individual
batch buffers should have a first-level, fine-grained, hardware-based hang
detection mechanism on top of the ordinary, software-based periodic hang
checker that is already in the driver. The advantage over relying solely on the
current software-based hang checker is that the watchdog timeout mechanism is
about 1000x quicker and more precise. Since it's not a full driver-level hang
detection mechanism but only targetting one individual batch buffer at a time
it can afford to be that quick without risking an increase in false positive
hang detections.

This feature includes the following changes:

a) IRQ service routine for handling watchdog timeout interrupts and connecting
these directly to the per-engine hang recovery path via the driver error
handler.

b) Injection of watchdog timer enablement/cancellation instructions
before/after the batch buffer start instruction in the ring buffer so that
watchdog timeout is connected to the submission of an individual batch buffer.

c) Extension of the DRM batch buffer interface, exposing the watchdog timeout
feature to userland.

There is currently full watchdog timeout support for gen7 in Android and it is
quite similar to the gen8 implementation so there is nothing obvious that
prevents us from upstreaming that code along with the gen8 code. However,
watchdog timeout is fully dependent on the per-engine hang recovery path and
that is not part of this code submission for gen7. Therefore watchdog timeout
support for gen7 has been excluded until per-engine hang recovery support for
gen7 has landed upstream.

As part of this submission we've had to reinstate the work queue that was
previously in place between the error handler and the hang recovery path. The
reason for this is that the per-engine recovery path is called directly from
the interrupt handler in the case of watchdog timeout. In that situation
there's no way of grabbing the struct_mutex, which is a requirement for the
hang recovery path. Therefore, by reinstating the work queue we provide a
unified execution context for the hang recovery code that allows the hang
recovery code to grab whatever locks it needs without sacrificing interrupt
latency too much or sleeping indefinitely in hard interrupt context.

* Feature 3. Context Submission Status Consistency (CSSC) checking

Something that becomes apparent when you run long-duration operations tests
with concurrent rendering processes with intermittently injected hangs is that
it seems like the GPU fails to send context event interrupts to the driver
under some circumstances (in particular context completion and element switch
events that would normally lead to the context event IRQ handler removing
completed contexts from the execlist queue and submit new contexts to the
hardware, thus driving work submission forward). What this means is that the
driver sometimes gets stuck on a context that never seems to finish, all the
while the hardware has completed and is waiting for more work.

Normally such hangs would be dealt with by the hang detection / hang recovery
functionality in the driver.  The problem with this in a per-engine hang
recovery situation is that the per-engine hang recovery path relies on context
resubmission in order to kick off the hardware again following an engine reset.
This can only be done safely if the hardware and driver share the same opinion
about the current state. Therefore, unless steps are taken the per-engine hang
recovery path is ineffective in this type of situation.

One step that can be taken to deal with this kind of state inconsistency is to,
after having detected such an inconsistency at the start of the per-engine hang
recovery path, fake the presumed lost context event interrupt by manually
calling the context event IRQ handler from the per-engine hang recovery path.
The important thing to note here is that even if the hardware would fail to
send an IRQ to the driver it doesn't mean that it would also fail to store the
proper context state transition events in the CSB buffer. These events would be
there, waiting to be processed, in the event of a lost context event IRQ. By
faking the IRQ these events would be found and acted upon and hopefully TDR
would then find the unprocessed context completion event that would cause the
blocking context to be removed from the execlist queue, thus allowing work
submission to resume.

In order to facilitate validation of this feature test infrastructure embedded
in the context event IRQ handler has been included for inducing state
inconsistencies by faking the loss of IRQs. By actively losing IRQs the
hardware goes into one state while the driver is held back waiting for the
state transition to happen. This then causes the context submission state
consistency checker to kick in, detect the inconsistency and attempt to rectify
it. Failing to do so would promote the hang to legacy full GPU reset like any
other per-engine hang recovery failures.

OPEN QUESTIONS:

	1. For the future, do we want to investigate the possibility of
	concurrent per-engine hang detection? In the current upstream driver
	there is only one work queue that handles the hang checker and
	everything from initial hang detection to final hang recovery runs in
	this thread. This makes sense if you're only supporting one form of
	hang recovery - using full GPU reset and nothing tied to any particular
	engine. However, as part of this patch series we're changing that by
	introducing per-engine hang recovery. It could make sense to introduce
	multiple work queues - one per engine - to run multiple hang checking
	threads in parallel.

	This would potentially allow savings in terms of recovery latency since
	we don't have to scan all engines every time the hang checker is
	scheduled and the error handler does not have to scan all engines every
	time it is scheduled. Instead, we could implement one work queue per
	engine that would invoke the hang checker that only checks _that_
	particular engine and then the error handler is invoked for _that_
	particular engine. If one engine has hung the latency for getting to
	the hang recovery path for that particular engine would be (Time For
	Hang Checking One Engine) + (Time For Error Handling One Engine) rather
	than the time it takes to do hang checking for all engines + the time
	it takes to do error handling for all engines that have been detected
	as hung (which in the worst case would be all engines). There would
	potentially be as many hang checker and error handling threads going on
	concurrently as there are engines in the hardware but they would all be
	running in parallel without any significant locking. The first time
	where any thread needs exclusive access to the driver is at the point
	of the actual hang recovery but the time it takes to get there would
	theoretically be lower and the time it actually takes to do per-engine
	hang recovery is quite a lot lower than the time it takes to actually
	detect a hang reliably.

	How much we would save by such a change still needs to be analysed and
	compared against the current single-thread model but it makes sense
	from a theoretical design point of view.

	2. In Android it has become apparent that if hang recovery takes too
	long there is a distinct risk that userland will fail in such a way
	that the display might freeze until reboot (SurfaceFlinger/HWC fails to
	flip the display for a long enough time while the hang is being
	handled, which leads to it giving up any further attempts). In such
	time-constrained scenarios it might be a problem to wait until the
	start of the engine recovery path to rectify any context submission
	state inconsistencies since it would typically take around 3 seconds to
	get there assuming a default hang check sample period of 1 second. In
	Android it turned out that we needed rectification to happen after
	about 2 seconds, which is considerably less time than what it normally
	takes to recover from hangs. This might become a problem if this TDR
	implementation were ever to be ported to Android or other similar
	time-constrained environments. Issues like this has yet to show up in X
	or elsewhere and it is an open question how this TDR implementation
	would behave were it ever to be ported back to Android.

	Worth keeping in mind for the future!

GITHUB REPOSITORY:
https://github.com/telf/TDR_patch_series_1.git

Tim Gore (1):
  drm/i915: drm/i915 changes to simulated hangs

Tomas Elf (19):
  drm/i915: Make i915_gem_reset_ring_status() public
  drm/i915: Generalise common GPU engine reset request/unrequest code
  drm/i915: TDR / per-engine hang recovery support for gen8.
  drm/i915: TDR / per-engine hang detection
  drm/i915: Extending i915_gem_check_wedge to check engine reset in
    progress
  drm/i915: Reinstate hang recovery work queue.
  drm/i915: Watchdog timeout: Hang detection integration into error
    handler
  drm/i915: Watchdog timeout: IRQ handler for gen8
  drm/i915: Watchdog timeout: Ringbuffer command emission for gen8
  drm/i915: Watchdog timeout: DRM kernel interface enablement
  drm/i915: Fake lost context event interrupts through forced CSB
    checking.
  drm/i915: Debugfs interface for per-engine hang recovery.
  drm/i915: Test infrastructure for context state inconsistency
    simulation
  drm/i915: TDR/watchdog trace points.
  drm/i915: Port of Added scheduler support to __wait_request() calls
  drm/i915: Fix __i915_wait_request() behaviour during hang detection.
  drm/i915: Extended error state with TDR count, watchdog count and
    engine reset count
  drm/i915: TDR / per-engine hang recovery kernel docs
  drm/i915: Enable TDR / per-engine hang recovery

 Documentation/DocBook/gpu.tmpl          | 476 ++++++++++++++++++
 drivers/gpu/drm/i915/i915_debugfs.c     | 163 +++++-
 drivers/gpu/drm/i915/i915_dma.c         |  80 +++
 drivers/gpu/drm/i915/i915_drv.c         | 328 ++++++++++++
 drivers/gpu/drm/i915/i915_drv.h         |  92 +++-
 drivers/gpu/drm/i915/i915_gem.c         | 152 +++++-
 drivers/gpu/drm/i915/i915_gpu_error.c   |   8 +-
 drivers/gpu/drm/i915/i915_irq.c         | 286 +++++++++--
 drivers/gpu/drm/i915/i915_params.c      |  19 +
 drivers/gpu/drm/i915/i915_reg.h         |   9 +
 drivers/gpu/drm/i915/i915_trace.h       | 354 ++++++++++++-
 drivers/gpu/drm/i915/intel_display.c    |   3 +-
 drivers/gpu/drm/i915/intel_lrc.c        | 857 +++++++++++++++++++++++++++++++-
 drivers/gpu/drm/i915/intel_lrc.h        |  16 +-
 drivers/gpu/drm/i915/intel_lrc_tdr.h    |  39 ++
 drivers/gpu/drm/i915/intel_ringbuffer.c |  90 +++-
 drivers/gpu/drm/i915/intel_ringbuffer.h |  95 ++++
 drivers/gpu/drm/i915/intel_uncore.c     | 197 +++++++-
 include/uapi/drm/i915_drm.h             |   5 +-
 19 files changed, 3144 insertions(+), 125 deletions(-)
 create mode 100644 drivers/gpu/drm/i915/intel_lrc_tdr.h

-- 
1.9.1