[Intel-gfx] [RFC 1/3] drm/i915: Watchdog timeout: IRQ handler for gen8+
Chris Wilson
chris at chris-wilson.co.uk
Thu Feb 23 20:57:54 UTC 2017
On Thu, Feb 23, 2017 at 11:44:17AM -0800, Michel Thierry wrote:
> *** General ***
>
> Watchdog timeout (or "media engine reset") is a feature that allows
> userland applications to enable hang detection on individual batch buffers.
> The detection mechanism itself is mostly bound to the hardware and the only
> thing that the driver needs to do to support this form of hang detection
> is to implement the interrupt handling support as well as watchdog command
> emission before and after the emitted batch buffer start instruction in the
> ring buffer.
>
> The principle of the hang detection mechanism is as follows:
>
> 1. Once the decision has been made to enable watchdog timeout for a
> particular batch buffer and the driver is in the process of emitting the
> batch buffer start instruction into the ring buffer it also emits a
> watchdog timer start instruction before and a watchdog timer cancellation
> instruction after the batch buffer start instruction in the ring buffer.
>
> 2. Once the GPU execution reaches the watchdog timer start instruction
> the hardware watchdog counter is started by the hardware. The counter
> keeps counting until either reaching a previously configured threshold
> value or the timer cancellation instruction is executed.
>
> 2a. If the counter reaches the threshold value the hardware fires a
> watchdog interrupt that is picked up by the watchdog interrupt handler.
> This means that a hang has been detected and the driver needs to deal with
> it the same way it would deal with a engine hang detected by the periodic
> hang checker. The only difference between the two is that we already blamed
> the active request (to ensure an engine reset).
>
> 2b. If the batch buffer completes and the execution reaches the watchdog
> cancellation instruction before the watchdog counter reaches its
> threshold value the watchdog is cancelled and nothing more comes of it.
> No hang is detected.
>
> Note about future interaction with preemption: Preemption could happen
> in a command sequence prior to watchdog counter getting disabled,
> resulting in watchdog being triggered following preemption. The driver will
> need to explicitly disable the watchdog counter as part of the
> preemption sequence.
>
> *** This patch introduces: ***
>
> 1. IRQ handler code for watchdog timeout allowing direct hang recovery
> based on hardware-driven hang detection, which then integrates directly
> with the hang recovery path. This is independent of having per-engine reset
> or just full gpu reset.
>
> 2. Watchdog specific register information.
>
> Currently the render engine and all available media engines support
> watchdog timeout (VECS is only supported in GEN9). The specifications elude
> to the BCS engine being supported but that is currently not supported by
> this commit.
>
> Note that the value to stop the counter is different between render and
> non-render engines.
>
> Signed-off-by: Tomas Elf <tomas.elf at intel.com>
> Signed-off-by: Ian Lister <ian.lister at intel.com>
> Signed-off-by: Arun Siluvery <arun.siluvery at linux.intel.com>
> Signed-off-by: Michel Thierry <michel.thierry at intel.com>
> ---
> drivers/gpu/drm/i915/i915_drv.h | 4 ++++
> drivers/gpu/drm/i915/i915_irq.c | 31 ++++++++++++++++++++++++++++++-
> drivers/gpu/drm/i915/i915_reg.h | 6 ++++++
> drivers/gpu/drm/i915/intel_hangcheck.c | 13 +++++++++----
> drivers/gpu/drm/i915/intel_lrc.c | 16 ++++++++++++++++
> 5 files changed, 65 insertions(+), 5 deletions(-)
>
> diff --git a/drivers/gpu/drm/i915/i915_drv.h b/drivers/gpu/drm/i915/i915_drv.h
> index eed9ead1b592..0e4f4cc3c6de 100644
> --- a/drivers/gpu/drm/i915/i915_drv.h
> +++ b/drivers/gpu/drm/i915/i915_drv.h
> @@ -1568,6 +1568,9 @@ struct i915_gpu_error {
> * recovery. All waiters on the reset_queue will be woken when
> * that happens.
> *
> + * When hw detects a hang before us, we can use I915_RESET_WATCHDOG to
> + * report the hang detection cause accurately.
> + *
> * This counter is used by the wait_seqno code to notice that reset
> * event happened and it needs to restart the entire ioctl (since most
> * likely the seqno it waited for won't ever signal anytime soon).
> @@ -1580,6 +1583,7 @@ struct i915_gpu_error {
>
> unsigned long flags;
> #define I915_RESET_IN_PROGRESS 0
> +#define I915_RESET_WATCHDOG 2 /* looking at the future */
> #define I915_WEDGED (BITS_PER_LONG - 1)
>
> /**
> diff --git a/drivers/gpu/drm/i915/i915_irq.c b/drivers/gpu/drm/i915/i915_irq.c
> index bc70e2c451b2..4ef73363bbe9 100644
> --- a/drivers/gpu/drm/i915/i915_irq.c
> +++ b/drivers/gpu/drm/i915/i915_irq.c
> @@ -1352,6 +1352,28 @@ gen8_cs_irq_handler(struct intel_engine_cs *engine, u32 iir, int test_shift)
> set_bit(ENGINE_IRQ_EXECLIST, &engine->irq_posted);
> tasklet_hi_schedule(&engine->irq_tasklet);
> }
> +
> + if (iir & (GT_GEN8_WATCHDOG_INTERRUPT << test_shift)) {
> + struct drm_i915_private *dev_priv = engine->i915;
> + u32 watchdog_disable;
> +
> + if (engine->id == RCS)
> + watchdog_disable = GEN8_RCS_WATCHDOG_DISABLE;
> + else
> + watchdog_disable = GEN8_XCS_WATCHDOG_DISABLE;
> +
> + /* Stop the counter to prevent further timeout interrupts */
> + I915_WRITE_FW(RING_CNTR(engine->mmio_base), watchdog_disable);
There's no guarrantee you hold forcewake, you need to use I915_WRITE.
Better yet would be to avoid having to wait for forcewake within the
hardirq handler.
> +
> + /* Make sure the active request will be marked as guilty */
> + engine->hangcheck.stalled = true;
> + engine->hangcheck.seqno = intel_engine_get_seqno(engine);
Just set a flag saying the engine->hangcheck.watchdog = true. Don't
confuse us. engine->hangcheck.seqno does not give the guilty seqno!
Also there is no guarrantee here that seqno is the guilty party. That's
a nasty bug. Servicing the interrupt will be running in parallel with
the GPU that may complete the request before we read the HWS.
Please tell me we can use a PID along with the watchdog timer...
-Chris
--
Chris Wilson, Intel Open Source Technology Centre
More information about the Intel-gfx
mailing list