[Intel-gfx] [PATCH] drm/i915: fix wait_for_pending_flips vs gpu hang deadlock
Daniel Vetter
daniel.vetter at ffwll.ch
Sun Sep 8 21:57:13 CEST 2013
My g33 here seems to be shockingly good at hitting them all. This time
around kms_flip/flip-vs-panning-vs-hang blows up:
intel_crtc_wait_for_pending_flips correctly checks for gpu hangs and
if a gpu hang is pending aborts the wait for outstanding flips so that
the setcrtc call will succeed and release the crtc mutex. And the gpu
hang handler needs that lock in intel_display_handle_reset to be able
to complete outstanding flips.
The problem is that we can race in two ways:
- Waiters on the dev_priv->pending_flip_queue aren't woken up after
we've the reset as pending, but before we actually start the reset
work. This means that the waiter doesn't notice the pending reset
and hence will keep on hogging the locks.
Like with dev->struct_mutex and the ring->irq_queue wait queues we
there need to wake up everyone that potentially holds a lock which
the reset handler needs.
- intel_display_handle_reset was called _after_ we've already
signalled the completion of the reset work. Which means a waiter
could sneak in, grab the lock and never release it (since the
pageflips won't ever get released).
Similar to resetting the gem state all the reset work must complete
before we update the reset counter. Contrary to the gem reset we
don't need to have a second explicit wake up call since that will
have happened already when completing the pageflips. We also don't
have any issues that the completion happens while the reset state is
still pending - wait_for_pending_flips is only there to ensure we
display the right frame. After a gpu hang&reset events such
guarantees are out the window anyway. This is in contrast to the gem
code where too-early wake-up would result in unnecessary restarting
of ioctls.
Also, since we've gotten these various deadlocks and ordering
constraints wrong so often throw copious amounts of comments at the
code.
This deadlock regression has been introduced in the commit which added
the pageflip reset logic to the gpu hang work:
commit 96a02917a0131e52efefde49c2784c0421d6c439
Author: Ville Syrjälä <ville.syrjala at linux.intel.com>
Date: Mon Feb 18 19:08:49 2013 +0200
drm/i915: Finish page flips and update primary planes after a GPU reset
v2:
- Add comments to explain how the wake_up serves as memory barriers
for the atomic_t reset counter.
- Improve the comments a bit as suggested by Chris Wilson.
- Extract the wake_up calls before/after the reset into a little
i915_error_wake_up and unconditionally wake up the
pending_flip_queue waiters, again as suggested by Chris Wilson.
v3: Throw copious amounts of comments at i915_error_wake_up as
suggested by Chris Wilson.
Cc: stable at vger.kernel.org
Cc: Ville Syrjälä <ville.syrjala at linux.intel.com>
Cc: Chris Wilson <chris at chris-wilson.co.uk>
Reviewed-by: Chris Wilson <chris at chris-wilson.co.uk>
Signed-off-by: Daniel Vetter <daniel.vetter at ffwll.ch>
---
drivers/gpu/drm/i915/i915_irq.c | 68 ++++++++++++++++++++++++++++++++---------
1 file changed, 54 insertions(+), 14 deletions(-)
diff --git a/drivers/gpu/drm/i915/i915_irq.c b/drivers/gpu/drm/i915/i915_irq.c
index 83cce0c..4b91228 100644
--- a/drivers/gpu/drm/i915/i915_irq.c
+++ b/drivers/gpu/drm/i915/i915_irq.c
@@ -1469,6 +1469,34 @@ static irqreturn_t ironlake_irq_handler(int irq, void *arg)
return ret;
}
+static void i915_error_wake_up(struct drm_i915_private *dev_priv,
+ bool reset_completed)
+{
+ struct intel_ring_buffer *ring;
+ int i;
+
+ /*
+ * Notify all waiters for GPU completion events that reset state has
+ * been changed, and that they need to restart their wait after
+ * checking for potential errors (and bail out to drop locks if there is
+ * a gpu reset pending so that i915_error_work_func can acquire them).
+ */
+
+ /* Wake up __wait_seqno, potentially holding dev->struct_mutex. */
+ for_each_ring(ring, dev_priv, i)
+ wake_up_all(&ring->irq_queue);
+
+ /* Wake up intel_crtc_wait_for_pending_flips, holding crtc->mutex. */
+ wake_up_all(&dev_priv->pending_flip_queue);
+
+ /*
+ * Signal tasks blocked in i915_gem_wait_for_error that the pending
+ * reset state is cleared.
+ */
+ if (reset_completed)
+ wake_up_all(&dev_priv->gpu_error.reset_queue);
+}
+
/**
* i915_error_work_func - do process context error handling work
* @work: work struct
@@ -1483,11 +1511,10 @@ static void i915_error_work_func(struct work_struct *work)
drm_i915_private_t *dev_priv = container_of(error, drm_i915_private_t,
gpu_error);
struct drm_device *dev = dev_priv->dev;
- struct intel_ring_buffer *ring;
char *error_event[] = { I915_ERROR_UEVENT "=1", NULL };
char *reset_event[] = { I915_RESET_UEVENT "=1", NULL };
char *reset_done_event[] = { I915_ERROR_UEVENT "=0", NULL };
- int i, ret;
+ int ret;
kobject_uevent_env(&dev->primary->kdev.kobj, KOBJ_CHANGE, error_event);
@@ -1506,8 +1533,16 @@ static void i915_error_work_func(struct work_struct *work)
kobject_uevent_env(&dev->primary->kdev.kobj, KOBJ_CHANGE,
reset_event);
+ /*
+ * All state reset _must_ be completed before we update the
+ * reset counter, for otherwise waiters might miss the reset
+ * pending state and not properly drop locks, resulting in
+ * deadlocks with the reset work.
+ */
ret = i915_reset(dev);
+ intel_display_handle_reset(dev);
+
if (ret == 0) {
/*
* After all the gem state is reset, increment the reset
@@ -1528,12 +1563,11 @@ static void i915_error_work_func(struct work_struct *work)
atomic_set(&error->reset_counter, I915_WEDGED);
}
- for_each_ring(ring, dev_priv, i)
- wake_up_all(&ring->irq_queue);
-
- intel_display_handle_reset(dev);
-
- wake_up_all(&dev_priv->gpu_error.reset_queue);
+ /*
+ * Note: The wake_up also serves as a memory barrier so that
+ * waiters see the update value of the reset counter atomic_t.
+ */
+ i915_error_wake_up(dev_priv, true);
}
}
@@ -1642,8 +1676,6 @@ static void i915_report_and_clear_eir(struct drm_device *dev)
void i915_handle_error(struct drm_device *dev, bool wedged)
{
struct drm_i915_private *dev_priv = dev->dev_private;
- struct intel_ring_buffer *ring;
- int i;
i915_capture_error_state(dev);
i915_report_and_clear_eir(dev);
@@ -1653,11 +1685,19 @@ void i915_handle_error(struct drm_device *dev, bool wedged)
&dev_priv->gpu_error.reset_counter);
/*
- * Wakeup waiting processes so that the reset work item
- * doesn't deadlock trying to grab various locks.
+ * Wakeup waiting processes so that the reset work function
+ * i915_error_work_func doesn't deadlock trying to grab various
+ * locks. By bumping the reset counter first, the woken
+ * processes will see a reset in progress and back off,
+ * releasing their locks and then wait for the reset completion.
+ * We must do this for _all_ gpu waiters that might hold locks
+ * that the reset work needs to acquire.
+ *
+ * Note: The wake_up serves as the required memory barrier to
+ * ensure that the waiters see the updated value of the reset
+ * counter atomic_t.
*/
- for_each_ring(ring, dev_priv, i)
- wake_up_all(&ring->irq_queue);
+ i915_error_wake_up(dev_priv, false);
}
/*
--
1.8.4.rc3
More information about the Intel-gfx
mailing list