[Intel-gfx] [PATCH 0/5] robustify reset state transitions

Mon Nov 12 23:07:48 CET 2012

Hi all,

So I've noticed again that the hangman test was failing on some machines here,
and tracked it down to the new lockless wait code. Closer inspection showed that
we've relied on the single dev->struct_mutex ordering things correctly between
waiters and the reset code. But with that lock grabbing gone, the entire reset
could happen before the waiter wakes up and hence the waiter never sees a
non-zeor wedged value. Which means it'll go right back to sleep, waiting for a
seqno which just go cleared out by the reset code.

Looking at the code I've declared the entire thing to ad-hoc and revamped it,
adding comments explaining what's going on all over the place and auditing for
tiny races everywhere. Hopefully I've caugth them all, at least the machines
that previously hung after reset are now happily going through a few hundres
reset cycles!

Comments, flames and especially review highly welcome.

For fun (hey, let me have it!) I've thrown in some "let's move stuff around a
bit" patches at the beginning ;-)

Cheers, Daniel

Daniel Vetter (5):
  drm/i915: move dev_priv->mm out of line
  drm/i915: extract hangcheck/reset/error_state state into substruct
  drm/i915: move wedged to the other gpu error handling stuff
  drm/i915: clear up wedged transitions
  drm/i915: create a race-free reset detection

 drivers/gpu/drm/i915/i915_debugfs.c     |  12 +-
 drivers/gpu/drm/i915/i915_dma.c         |   9 +-
 drivers/gpu/drm/i915/i915_drv.c         |   8 +-
 drivers/gpu/drm/i915/i915_drv.h         | 274 ++++++++++++++++++--------------
 drivers/gpu/drm/i915/i915_gem.c         | 110 +++++++------
 drivers/gpu/drm/i915/i915_irq.c         |  89 +++++++----
 drivers/gpu/drm/i915/intel_display.c    |   4 +-
 drivers/gpu/drm/i915/intel_ringbuffer.c |   8 +-
 8 files changed, 297 insertions(+), 217 deletions(-)

-- 
1.7.11.4