[Intel-gfx] [RFC 00/13] TDR and Watchdog Reset

Lister, Ian ian.lister at intel.com
Mon Dec 16 17:02:20 CET 2013


This patchset contains TDR and Watchdog reset against a 3.13
drm-intel-nightly tree that is now about 2 weeks old.

I have re-worked the TDR and Watchdog Reset features to integrate
them more closely with the existing TDR and scoring mechanism.

This is still a work-in-progress and I am currently debugging a couple
of issues but I would like to get some early feedback.

Thanks,
Ian.

From f71a7de85e9d81be3aa3962c8fe2557235ff21c1 Mon Sep 17 00:00:00 2001
Message-Id: <cover.1387201899.git.ian.lister at intel.com>
From: ian-lister <ian.lister at intel.com>
Date: Mon, 16 Dec 2013 13:51:39 +0000
Subject: [RFC 00/13] TDR and Watchdog Reset

This patchset adds support for per-engine timeout detection and recovery
and adds batch specific watchdog reset.

Per-ring TDR
The detection logic has been modified to detect hangs on individual
engines and pass this information through the to the recovery handler.
Rather than a global reset it will attempt a per-engine reset. The
registers associated with the ring are saved and restored so that
when the ring restarts it continues from the next instruction in the
ring. For example, if it was executing an MI_START_BATCH_BUFFER command
it will advance to the next instruction which is likely to be the
mailbox updates and user interrupt. This means that no extra effort
is required to deal with synchronisation. From the perspective of the
driver it looks like the batch buffer completed normally as all the
normal signalling will take place, however the context stats will
have been updated to flag up the guilty context.

Watchdog Reset
This is requested via flags to the batch buffer submission IOCTL.
It is currently only supported for the render and video rings.
The batch buffer command is surrounded by a hardware timer start
command and stop command. If the batch completes before the timer
expires then the timer is cancelled and no interrupt is generated
so everything continues normally. However if the batch hangs then
the timer will generate an interrupt and it will trigger an engine
reset. This feature requires per-ring TDR to do the recovery work.

ian-lister (13):
  drm/i915: Periodic sampling for hang detection
  drm/i915: Improved hang detection logic
  drm/i915: Additional ring operations for TDR
  drm/i915: Force wake restore for TDR
  drm/i915: Per-engine recovery
  drm/i915: Communicating reset requests
  drm/i915: Additional debug for TDR
  drm/i915: TDR loose ends
  drm/i915: Watchdog timer support functions
  drm/i915: MI_LOAD_REGISTER_IMM fix
  drm/i915: Added watchdog interrupt handling
  drm/i915: Enabled watchdog timer interrupts
  drm/i915: Exec buffer inserts watchdog commands

 drivers/gpu/drm/i915/i915_debugfs.c        |  67 ++++
 drivers/gpu/drm/i915/i915_dma.c            |   3 +
 drivers/gpu/drm/i915/i915_drv.c            |  46 +++
 drivers/gpu/drm/i915/i915_drv.h            |  41 ++-
 drivers/gpu/drm/i915/i915_gem.c            |  26 +-
 drivers/gpu/drm/i915/i915_gem_execbuffer.c |  30 +-
 drivers/gpu/drm/i915/i915_irq.c            | 531
++++++++++++++++++---------
 drivers/gpu/drm/i915/i915_reg.h            |  21 ++
 drivers/gpu/drm/i915/intel_display.c       |  30 +-
 drivers/gpu/drm/i915/intel_ringbuffer.c    | 557
+++++++++++++++++++++++++++--
 drivers/gpu/drm/i915/intel_ringbuffer.h    |  53 +++
 drivers/gpu/drm/i915/intel_uncore.c        | 373 ++++++++++++++++++-
 include/drm/drmP.h                         |   7 +
 13 files changed, 1575 insertions(+), 210 deletions(-)

-- 
1.8.5.1




More information about the Intel-gfx mailing list