[Intel-gfx] [PATCH 3/7] drm/i915: More surgically unbreak the modeset vs reset deadlock

Mon Aug 7 15:39:11 UTC 2017

On Thu, Aug 03, 2017 at 12:35:48PM -0700, Michel Thierry wrote:
> Hi,
> 
> First sorry about the delay...
> 
> On 7/20/2017 10:57 AM, Daniel Vetter wrote:
> > There's no reason to entirely wedge the gpu, for the minimal deadlock
> > bugfix we only need to unbreak/decouple the atomic commit from the gpu
> > reset. The simplest way to fix that is by replacing the
> > unconditional fence wait a the top of commit_tail by a wait which
> > completes either when the fences are done (normal case, or when a
> > reset doesn't need to touch the display state). Or when the gpu reset
> > needs to force-unblock all pending modeset states.
> > 
> > The lesser source of deadlocks is when we try to pin a new framebuffer
> > and run into a stall. There's a bunch of places this can happen, like
> > eviction, changing the caching mode, acquiring a fence on older
> > platforms. And we can't just break the depency loop and keep going,
> > the only way would be to break out and restart. But the problem with
> > that approach is that we must stall for the reset to complete before
> > we grab any locks, and with the atomic infrastructure that's a bit
> > tricky. The only place is the ioctl code, and we don't want to insert
> > code into e.g. the BUSY ioctl. Hence for that problem just create a
> > critical section, and if any code is in there, wedge the GPU. For the
> > steady-state this should never be a problem.
> > 
> > Note that in both cases TDR itself keeps working, so from a userspace
> > pov this trickery isn't observable. Users themselvs might spot a short
> > glitch while the rendering is catching up again, but that's still
> > better than pre-TDR where we've thrown away all the rendering,
> > including innocent batches. Also, this fixes the regression TDR
> > introduced of making gpu resets deadlock-prone when we do need to
> > touch the display.
> > 
> > One thing I noticed is that gpu_error.flags seems to use both our own
> > wait-queue in gpu_error.wait_queue, and the generic wait_on_bit
> > facilities. Not entirely sure why this inconsistency exists, I just
> > picked one style.
> > 
> > A possible future avenue could be to insert the gpu reset in-between
> > ongoing modeset changes, which would avoid the momentary glitch. But
> > that's a lot more work to implement in the atomic commit machinery,
> > and given that we only need this for pre-g4x hw, of questionable
> > utility just for the sake of polishing gpu reset even more on those
> > old boxes. It might be useful for other features though.
> > 
> > v2: Rebase onto 4.13 with a s/wait_queue_t/struct wait_queue_entry/.
> > 
> > v3: Really emabarrassing fixup, I checked the wrong bit and broke the
> > unbreak/wakeup logic.
> > 
> > v4: Also handle deadlocks in pin_to_display.
> > 
> > Cc: Chris Wilson <chris at chris-wilson.co.uk>
> > Cc: Mika Kuoppala <mika.kuoppala at intel.com>
> > Cc: Joonas Lahtinen <joonas.lahtinen at linux.intel.com>
> > Signed-off-by: Daniel Vetter <daniel.vetter at intel.com>
> > ---
> >   drivers/gpu/drm/i915/i915_drv.h      |  3 +++
> >   drivers/gpu/drm/i915/intel_display.c | 45 +++++++++++++++++++++++++++++++-----
> >   2 files changed, 42 insertions(+), 6 deletions(-)
> > 
> > diff --git a/drivers/gpu/drm/i915/i915_drv.h b/drivers/gpu/drm/i915/i915_drv.h
> > index 07e98b07c5bc..9b055deca36d 100644
> > --- a/drivers/gpu/drm/i915/i915_drv.h
> > +++ b/drivers/gpu/drm/i915/i915_drv.h
> > @@ -1505,6 +1505,8 @@ struct i915_gpu_error {
> >          /* Protected by the above dev->gpu_error.lock. */
> >          struct i915_gpu_state *first_error;
> > 
> > +       atomic_t pending_fb_pin;
> > +
> >          unsigned long missed_irq_rings;
> > 
> >          /**
> > @@ -1564,6 +1566,7 @@ struct i915_gpu_error {
> >          unsigned long flags;
> >   #define I915_RESET_BACKOFF     0
> >   #define I915_RESET_HANDOFF     1
> > +#define I915_RESET_MODESET     2
> >   #define I915_WEDGED            (BITS_PER_LONG - 1)
> >   #define I915_RESET_ENGINE      (I915_WEDGED - I915_NUM_ENGINES)
> > 
> 
> Since you still need to resend this, could you also update the
> I915_RESET_ENGINE overflow check in i915_handle_error? i.e.:
> 
> ---BUILD_BUG_ON(I915_RESET_HANDOFF >= I915_RESET_ENGINE);
> +++BUILD_BUG_ON(I915_RESET_MODESET >= I915_RESET_ENGINE);
> 
> Not that we would have this problem anytime soon...

Yeah missed that, will fix.

> > diff --git a/drivers/gpu/drm/i915/intel_display.c b/drivers/gpu/drm/i915/intel_display.c
> > index f6bd6282d7f7..63ea8d6b2ebd 100644
> > --- a/drivers/gpu/drm/i915/intel_display.c
> > +++ b/drivers/gpu/drm/i915/intel_display.c
> > @@ -2162,6 +2162,8 @@ intel_pin_and_fence_fb_obj(struct drm_framebuffer *fb, unsigned int rotation)
> >           */
> >          intel_runtime_pm_get(dev_priv);
> > 
> > +       atomic_inc(&dev_priv->gpu_error.pending_fb_pin);
> > +
> 
> Do we also need this in intel_overlay_do_put_image? It is also being called
> when the struct_mutex is held.

Indeed I missed that, will fix too and resubmit the pile (once we agree on
a comment for the previous patch).

Thanks, Daniel

> 
> 
> >          vma = i915_gem_object_pin_to_display_plane(obj, alignment, &view);
> >          if (IS_ERR(vma))
> >                  goto err;
> > @@ -2189,6 +2191,8 @@ intel_pin_and_fence_fb_obj(struct drm_framebuffer *fb, unsigned int rotation)
> > 
> >          i915_vma_get(vma);
> >   err:
> > +       atomic_dec(&dev_priv->gpu_error.pending_fb_pin);
> > +
> >          intel_runtime_pm_put(dev_priv);
> >          return vma;
> >   }
> > @@ -3471,11 +3475,14 @@ void intel_prepare_reset(struct drm_i915_private *dev_priv)
> >              !gpu_reset_clobbers_display(dev_priv))
> >                  return;
> > 
> > -       /* We have a modeset vs reset deadlock, defensively unbreak it.
> > -        *
> > -        * FIXME: We can do a _lot_ better, this is just a first iteration.*/
> > -       i915_gem_set_wedged(dev_priv);
> > -       DRM_DEBUG_DRIVER("Wedging GPU to avoid deadlocks with pending modeset updates\n");
> > +       /* We have a modeset vs reset deadlock, defensively unbreak it. */
> > +       set_bit(I915_RESET_MODESET, &dev_priv->gpu_error.flags);
> > +       wake_up_all(&dev_priv->gpu_error.wait_queue);
> > +
> > +       if (atomic_read(&dev_priv->gpu_error.pending_fb_pin)) {
> > +               DRM_DEBUG_KMS("Modeset potentially stuck, unbreaking through wedging\n");
> > +               i915_gem_set_wedged(dev_priv);
> > +       }
> > 
> >          /*
> >           * Need mode_config.mutex so that we don't
> > @@ -3570,6 +3577,8 @@ void intel_finish_reset(struct drm_i915_private *dev_priv)
> >          drm_modeset_drop_locks(ctx);
> >          drm_modeset_acquire_fini(ctx);
> >          mutex_unlock(&dev->mode_config.mutex);
> > +
> > +       clear_bit(I915_RESET_MODESET, &dev_priv->gpu_error.flags);
> >   }
> > 
> >   static bool abort_flip_on_reset(struct intel_crtc *crtc)
> 		^^^ this is the only reason of why we need a rebase.
> 
> > @@ -12381,6 +12390,30 @@ static void intel_atomic_helper_free_state_worker(struct work_struct *work)
> >          intel_atomic_helper_free_state(dev_priv);
> >   }
> > 
> > +static void intel_atomic_commit_fence_wait(struct intel_atomic_state *intel_state)
> > +{
> > +       struct wait_queue_entry wait_fence, wait_reset;
> > +       struct drm_i915_private *dev_priv = to_i915(intel_state->base.dev);
> > +
> > +       init_wait_entry(&wait_fence, 0);
> > +       init_wait_entry(&wait_reset, 0);
> > +       for (;;) {
> > +               prepare_to_wait(&intel_state->commit_ready.wait,
> > +                               &wait_fence, TASK_UNINTERRUPTIBLE);
> > +               prepare_to_wait(&dev_priv->gpu_error.wait_queue,
> > +                               &wait_reset, TASK_UNINTERRUPTIBLE);
> > +
> > +
> > +               if (i915_sw_fence_done(&intel_state->commit_ready)
> > +                   || test_bit(I915_RESET_MODESET, &dev_priv->gpu_error.flags))
> > +                       break;
> > +
> > +               schedule();
> > +       }
> > +       finish_wait(&intel_state->commit_ready.wait, &wait_fence);
> > +       finish_wait(&dev_priv->gpu_error.wait_queue, &wait_reset);
> > +}
> > +
> >   static void intel_atomic_commit_tail(struct drm_atomic_state *state)
> >   {
> >          struct drm_device *dev = state->dev;
> > @@ -12394,7 +12427,7 @@ static void intel_atomic_commit_tail(struct drm_atomic_state *state)
> >          unsigned crtc_vblank_mask = 0;
> >          int i;
> > 
> > -       i915_sw_fence_wait(&intel_state->commit_ready);
> > +       intel_atomic_commit_fence_wait(intel_state);
> > 
> >          drm_atomic_helper_wait_for_dependencies(state);
> > 
> 
> Looks good to me.
> 
> -Michel

-- 
Daniel Vetter
Software Engineer, Intel Corporation
http://blog.ffwll.ch