[Intel-gfx] [PATCH] drm/i915: fix wait_for_pending_flips vs gpu hang deadlock
Daniel Vetter
daniel at ffwll.ch
Mon Sep 9 11:26:28 CEST 2013
On Sun, Sep 08, 2013 at 09:57:13PM +0200, Daniel Vetter wrote:
> My g33 here seems to be shockingly good at hitting them all. This time
> around kms_flip/flip-vs-panning-vs-hang blows up:
>
> intel_crtc_wait_for_pending_flips correctly checks for gpu hangs and
> if a gpu hang is pending aborts the wait for outstanding flips so that
> the setcrtc call will succeed and release the crtc mutex. And the gpu
> hang handler needs that lock in intel_display_handle_reset to be able
> to complete outstanding flips.
>
> The problem is that we can race in two ways:
> - Waiters on the dev_priv->pending_flip_queue aren't woken up after
> we've the reset as pending, but before we actually start the reset
> work. This means that the waiter doesn't notice the pending reset
> and hence will keep on hogging the locks.
>
> Like with dev->struct_mutex and the ring->irq_queue wait queues we
> there need to wake up everyone that potentially holds a lock which
> the reset handler needs.
>
> - intel_display_handle_reset was called _after_ we've already
> signalled the completion of the reset work. Which means a waiter
> could sneak in, grab the lock and never release it (since the
> pageflips won't ever get released).
>
> Similar to resetting the gem state all the reset work must complete
> before we update the reset counter. Contrary to the gem reset we
> don't need to have a second explicit wake up call since that will
> have happened already when completing the pageflips. We also don't
> have any issues that the completion happens while the reset state is
> still pending - wait_for_pending_flips is only there to ensure we
> display the right frame. After a gpu hang&reset events such
> guarantees are out the window anyway. This is in contrast to the gem
> code where too-early wake-up would result in unnecessary restarting
> of ioctls.
>
> Also, since we've gotten these various deadlocks and ordering
> constraints wrong so often throw copious amounts of comments at the
> code.
>
> This deadlock regression has been introduced in the commit which added
> the pageflip reset logic to the gpu hang work:
>
> commit 96a02917a0131e52efefde49c2784c0421d6c439
> Author: Ville Syrjälä <ville.syrjala at linux.intel.com>
> Date: Mon Feb 18 19:08:49 2013 +0200
>
> drm/i915: Finish page flips and update primary planes after a GPU reset
>
> v2:
> - Add comments to explain how the wake_up serves as memory barriers
> for the atomic_t reset counter.
> - Improve the comments a bit as suggested by Chris Wilson.
> - Extract the wake_up calls before/after the reset into a little
> i915_error_wake_up and unconditionally wake up the
> pending_flip_queue waiters, again as suggested by Chris Wilson.
>
> v3: Throw copious amounts of comments at i915_error_wake_up as
> suggested by Chris Wilson.
>
> Cc: stable at vger.kernel.org
> Cc: Ville Syrjälä <ville.syrjala at linux.intel.com>
> Cc: Chris Wilson <chris at chris-wilson.co.uk>
> Reviewed-by: Chris Wilson <chris at chris-wilson.co.uk>
> Signed-off-by: Daniel Vetter <daniel.vetter at ffwll.ch>
Ok, smashed onto -fixes. Thanks a lot for your critical review.
-Daniel
> ---
> drivers/gpu/drm/i915/i915_irq.c | 68 ++++++++++++++++++++++++++++++++---------
> 1 file changed, 54 insertions(+), 14 deletions(-)
>
> diff --git a/drivers/gpu/drm/i915/i915_irq.c b/drivers/gpu/drm/i915/i915_irq.c
> index 83cce0c..4b91228 100644
> --- a/drivers/gpu/drm/i915/i915_irq.c
> +++ b/drivers/gpu/drm/i915/i915_irq.c
> @@ -1469,6 +1469,34 @@ static irqreturn_t ironlake_irq_handler(int irq, void *arg)
> return ret;
> }
>
> +static void i915_error_wake_up(struct drm_i915_private *dev_priv,
> + bool reset_completed)
> +{
> + struct intel_ring_buffer *ring;
> + int i;
> +
> + /*
> + * Notify all waiters for GPU completion events that reset state has
> + * been changed, and that they need to restart their wait after
> + * checking for potential errors (and bail out to drop locks if there is
> + * a gpu reset pending so that i915_error_work_func can acquire them).
> + */
> +
> + /* Wake up __wait_seqno, potentially holding dev->struct_mutex. */
> + for_each_ring(ring, dev_priv, i)
> + wake_up_all(&ring->irq_queue);
> +
> + /* Wake up intel_crtc_wait_for_pending_flips, holding crtc->mutex. */
> + wake_up_all(&dev_priv->pending_flip_queue);
> +
> + /*
> + * Signal tasks blocked in i915_gem_wait_for_error that the pending
> + * reset state is cleared.
> + */
> + if (reset_completed)
> + wake_up_all(&dev_priv->gpu_error.reset_queue);
> +}
> +
> /**
> * i915_error_work_func - do process context error handling work
> * @work: work struct
> @@ -1483,11 +1511,10 @@ static void i915_error_work_func(struct work_struct *work)
> drm_i915_private_t *dev_priv = container_of(error, drm_i915_private_t,
> gpu_error);
> struct drm_device *dev = dev_priv->dev;
> - struct intel_ring_buffer *ring;
> char *error_event[] = { I915_ERROR_UEVENT "=1", NULL };
> char *reset_event[] = { I915_RESET_UEVENT "=1", NULL };
> char *reset_done_event[] = { I915_ERROR_UEVENT "=0", NULL };
> - int i, ret;
> + int ret;
>
> kobject_uevent_env(&dev->primary->kdev.kobj, KOBJ_CHANGE, error_event);
>
> @@ -1506,8 +1533,16 @@ static void i915_error_work_func(struct work_struct *work)
> kobject_uevent_env(&dev->primary->kdev.kobj, KOBJ_CHANGE,
> reset_event);
>
> + /*
> + * All state reset _must_ be completed before we update the
> + * reset counter, for otherwise waiters might miss the reset
> + * pending state and not properly drop locks, resulting in
> + * deadlocks with the reset work.
> + */
> ret = i915_reset(dev);
>
> + intel_display_handle_reset(dev);
> +
> if (ret == 0) {
> /*
> * After all the gem state is reset, increment the reset
> @@ -1528,12 +1563,11 @@ static void i915_error_work_func(struct work_struct *work)
> atomic_set(&error->reset_counter, I915_WEDGED);
> }
>
> - for_each_ring(ring, dev_priv, i)
> - wake_up_all(&ring->irq_queue);
> -
> - intel_display_handle_reset(dev);
> -
> - wake_up_all(&dev_priv->gpu_error.reset_queue);
> + /*
> + * Note: The wake_up also serves as a memory barrier so that
> + * waiters see the update value of the reset counter atomic_t.
> + */
> + i915_error_wake_up(dev_priv, true);
> }
> }
>
> @@ -1642,8 +1676,6 @@ static void i915_report_and_clear_eir(struct drm_device *dev)
> void i915_handle_error(struct drm_device *dev, bool wedged)
> {
> struct drm_i915_private *dev_priv = dev->dev_private;
> - struct intel_ring_buffer *ring;
> - int i;
>
> i915_capture_error_state(dev);
> i915_report_and_clear_eir(dev);
> @@ -1653,11 +1685,19 @@ void i915_handle_error(struct drm_device *dev, bool wedged)
> &dev_priv->gpu_error.reset_counter);
>
> /*
> - * Wakeup waiting processes so that the reset work item
> - * doesn't deadlock trying to grab various locks.
> + * Wakeup waiting processes so that the reset work function
> + * i915_error_work_func doesn't deadlock trying to grab various
> + * locks. By bumping the reset counter first, the woken
> + * processes will see a reset in progress and back off,
> + * releasing their locks and then wait for the reset completion.
> + * We must do this for _all_ gpu waiters that might hold locks
> + * that the reset work needs to acquire.
> + *
> + * Note: The wake_up serves as the required memory barrier to
> + * ensure that the waiters see the updated value of the reset
> + * counter atomic_t.
> */
> - for_each_ring(ring, dev_priv, i)
> - wake_up_all(&ring->irq_queue);
> + i915_error_wake_up(dev_priv, false);
> }
>
> /*
> --
> 1.8.4.rc3
>
--
Daniel Vetter
Software Engineer, Intel Corporation
+41 (0) 79 365 57 48 - http://blog.ffwll.ch
More information about the Intel-gfx
mailing list