[Intel-gfx] [PATCH 0/8] Stability improvements to error state capture

Thu Oct 8 11:31:32 PDT 2015

In preparation for the upcoming TDR per-engine hang recovery enablement the
stability of the error state capture code needs to be addressed. The biggest
reason for this is that in order to test TDR a long-duration test needs to be
run for several hours during which a large number of hangs is handled together
with the associated error state captures. In its current state the i915 driver
experiences various forms of kernel panics and other kinds of fatal errors
within the first hour(s) of the hang testing. The patches in this series have
been tested with a long-duration hang testing clocking in at 12+ hours and
should suffice as an initial improvement.

The underlying issue of trying to capture the driver state without
synchronization is still a problem that remains to be fixed. One way of at
least further alleviating this problem that has been suggested by John Harrison
is to do a mutex_trylock() of the struct_mutex for a while (give it a second or
so) before going into the error state capture from i915_handle_error(). Then,
if nobody is holding the struct_mutex, the error state capture is considerably
more safe from sudden state changes. If some thread has hung while holding the
struct_mutex one could at least hope that there would be no sudden state
changes during error state capture due to the hung state (unless some thread
has been caught in a livelock or is perhaps not stuck at all but is simply
running for a very long time - still some improvements might be expected here).

One fix that has been omitted from this patch series is in regards to the
broken ring space calculation following a full GPU reset. Two independent
patches to solve this are: "[PATCH] drm/i915: Update ring space correctly on
lrc context reset" by Mika Kuoppala and "[51/70] drm/i915: Record the position
of the start of the request" by Chris Wilson. Since the solution is currently
in review I'll simply mention it here as a pre-requistite for long-duration
operations stability testing. Without a fix for this problem the ring space is
terminally depleted within the first iterations of the hang test, simply
because the ring space is miscalculated following every GPU hang recovery and
traversal of the GEM init hw path gradually leading to a terminally hung state.

Tomas Elf (8):
  drm/i915: Early exit from semaphore_waits_for for execlist mode.
  drm/i915: Migrate to safe iterators in error state capture
  drm/i915: Cope with request list state change during error state
    capture
  drm/i915: NULL checking when capturing buffer objects during error
    state capture
  drm/i915: vma NULL pointer check
  drm/i915: Use safe list iterators
  drm/i915: Grab execlist spinlock to avoid post-reset concurrency
    issues.
  drm/i915: NULL check of unpin_work

 drivers/gpu/drm/i915/i915_gem.c       | 18 ++++++++---
 drivers/gpu/drm/i915/i915_gpu_error.c | 61 +++++++++++++++++++++++------------
 drivers/gpu/drm/i915/i915_irq.c       | 20 ++++++++++++
 drivers/gpu/drm/i915/intel_display.c  |  5 +++
 4 files changed, 80 insertions(+), 24 deletions(-)

-- 
1.9.1