[Intel-gfx] [PATCH v2] drm/i915: Drain the device workqueue on unload

Wed Jul 19 12:23:29 UTC 2017

Quoting Mika Kuoppala (2017-07-19 12:51:04)
> Chris Wilson <chris at chris-wilson.co.uk> writes:
> 
> > Quoting Mika Kuoppala (2017-07-19 12:18:47)
> >> Chris Wilson <chris at chris-wilson.co.uk> writes:
> >> 
> >> > Workers on the i915->wq may rearm themselves so for completeness we need
> >> > to replace our flush_workqueue() with a call to drain_workqueue() before
> >> > unloading the device.
> >> >
> >> > v2: Reinforce the drain_workqueue with an preceeding rcu_barrier() as a
> >> > few of the tasks that need to be drained may first be armed by RCU.
> >> >
> >> > References: https://bugs.freedesktop.org/show_bug.cgi?id=101627
> >> > Signed-off-by: Chris Wilson <chris at chris-wilson.co.uk>
> >> > Cc: Matthew Auld <matthew.auld at intel.com>
> >> > Cc: Mika Kuoppala <mika.kuoppala at linux.intel.com>
> >> > ---
> >> >  drivers/gpu/drm/i915/i915_drv.c                  |  6 ++----
> >> >  drivers/gpu/drm/i915/i915_drv.h                  | 20 ++++++++++++++++++++
> >> >  drivers/gpu/drm/i915/selftests/mock_gem_device.c |  2 +-
> >> >  3 files changed, 23 insertions(+), 5 deletions(-)
> >> >
> >> > diff --git a/drivers/gpu/drm/i915/i915_drv.c b/drivers/gpu/drm/i915/i915_drv.c
> >> > index 4b62fd012877..41c5b11a7c8f 100644
> >> > --- a/drivers/gpu/drm/i915/i915_drv.c
> >> > +++ b/drivers/gpu/drm/i915/i915_drv.c
> >> > @@ -596,7 +596,8 @@ static const struct vga_switcheroo_client_ops i915_switcheroo_ops = {
> >> >  
> >> >  static void i915_gem_fini(struct drm_i915_private *dev_priv)
> >> >  {
> >> > -     flush_workqueue(dev_priv->wq);
> >> > +     /* Flush any outstanding unpin_work. */
> >> > +     i915_gem_drain_workqueue(dev_priv);
> >> >  
> >> >       mutex_lock(&dev_priv->drm.struct_mutex);
> >> >       intel_uc_fini_hw(dev_priv);
> >> > @@ -1409,9 +1410,6 @@ void i915_driver_unload(struct drm_device *dev)
> >> >       cancel_delayed_work_sync(&dev_priv->gpu_error.hangcheck_work);
> >> >       i915_reset_error_state(dev_priv);
> >> >  
> >> > -     /* Flush any outstanding unpin_work. */
> >> > -     drain_workqueue(dev_priv->wq);
> >> > -
> >> >       i915_gem_fini(dev_priv);
> >> >       intel_uc_fini_fw(dev_priv);
> >> >       intel_fbc_cleanup_cfb(dev_priv);
> >> > diff --git a/drivers/gpu/drm/i915/i915_drv.h b/drivers/gpu/drm/i915/i915_drv.h
> >> > index 667fb5c44483..e9a4b96dc775 100644
> >> > --- a/drivers/gpu/drm/i915/i915_drv.h
> >> > +++ b/drivers/gpu/drm/i915/i915_drv.h
> >> > @@ -3300,6 +3300,26 @@ static inline void i915_gem_drain_freed_objects(struct drm_i915_private *i915)
> >> >       } while (flush_work(&i915->mm.free_work));
> >> >  }
> >> >  
> >> > +static inline void i915_gem_drain_workqueue(struct drm_i915_private *i915)
> >> > +{
> >> > +     /*
> >> > +      * Similar to objects above (see i915_gem_drain_freed-objects), in
> >> > +      * general we have workers that are armed by RCU and then rearm
> >> > +      * themselves in their callbacks. To be paranoid, we need to
> >> > +      * drain the workqueue a second time after waiting for the RCU
> >> > +      * grace period so that we catch work queued via RCU from the first
> >> > +      * pass. As neither drain_workqueue() nor flush_workqueue() report
> >> > +      * a result, we make an assumption that we only don't require more
> >> > +      * than 2 passes to catch all recursive RCU delayed work.
> >> > +      *
> >> > +      */
> >> > +     int pass = 2;
> >> > +     do {
> >> > +             rcu_barrier();
> >> > +             drain_workqueue(i915->wq);
> >> 
> >> I am fine with the paranoia, and it covers the case below. Still if we do:
> >> 
> >> drain_workqueue();
> >> rcu_barrier();
> >> 
> >> With drawining in progress, only chain queuing is allowed. I understand
> >> this so that when it returns, all the ctx pointers are now unreferenced
> >> but not freed.
> >> 
> >> Thus the rcu_barrier() after it cleans the trash and we are good to
> >> be unloaded. With one pass.
> >> 
> >> I guess it comes to how to understand the comment, so could you
> >> elaborate the 'we have workers that are armed by RCU and then rearm
> >> themselves'?. As from drain_workqueue desc, this should be covered.
> >
> > I'm considering that they may be rearmed via RCU in the general case,
> > e.g. context close frees an object and so goes onto an RCU list that
> > once processed kicks off a new worker and so requires another round of
> > drain_workqueue. We are in module unload so a few extra delays to belts
> > and braces are ok until somebody notices it takes a few minutes to run a
> > reload test ;)
> 
> Ok. Patch is
> Reviewed-by: Mika Kuoppala <mika.kuoppala at intel.com>

Thanks, I'm optimistic this will silence the bug, so marking it as
resolved. Pushed,
-Chris