[Intel-xe] [PATCH] drm/xe: Fix an invalid locking wait context bug

Rodrigo Vivi rodrigo.vivi at intel.com
Wed Jul 26 17:24:56 UTC 2023


On Wed, Jul 26, 2023 at 12:38:53PM +0100, Matthew Auld wrote:
> On 25/07/2023 21:17, Rodrigo Vivi wrote:
> > We cannot have spin locks around xe_irq_reset, since it will
> > call the intel_display_power_is_enabled() function, and
> > that needs a mutex lock. Hence causing the undesired
> > "[ BUG: Invalid wait context ]"
> > 
> > We cannot convert i915's power domain lock to spin lock
> > due to the nested dependency of non-atomic context waits.
> > 
> > So, let's move the xe_irq_reset functions from the
> > critical area, while still ensuring that we are protecting
> > the irq.enabled and ensuring the right serialization
> > in the irq handlers.
> > 
> > v2: On the first version, I had missed the fact that
> > irq.enabled is checked on the xe/display glue layer,
> > and that i915 display code is actually using the irq
> > spin lock properly. So, this got changed to a version
> > suggested by Matthew Auld, along with introducing
> > the lockdep_assert_held at the display glue code.
> >

Closes: https://gitlab.freedesktop.org/drm/xe/kernel/-/issues/463

> > Suggested-by: Matthew Auld <matthew.auld at intel.com>
> > Signed-off-by: Rodrigo Vivi <rodrigo.vivi at intel.com>
> > ---
> >   drivers/gpu/drm/xe/display/ext/i915_irq.c |  1 +
> >   drivers/gpu/drm/xe/xe_irq.c               | 32 ++++++++++++++++++-----
> >   2 files changed, 27 insertions(+), 6 deletions(-)
> > 
> > diff --git a/drivers/gpu/drm/xe/display/ext/i915_irq.c b/drivers/gpu/drm/xe/display/ext/i915_irq.c
> > index fbb0e99143f6..ac435f7f854b 100644
> > --- a/drivers/gpu/drm/xe/display/ext/i915_irq.c
> > +++ b/drivers/gpu/drm/xe/display/ext/i915_irq.c
> > @@ -107,6 +107,7 @@ void intel_display_irq_uninstall(struct drm_i915_private *dev_priv)
> >   bool intel_irqs_enabled(struct xe_device *xe)
> >   {
> > +	lockdep_assert_held(&xe->irq.lock);
> >   	return xe->irq.enabled;
> >   }
> > diff --git a/drivers/gpu/drm/xe/xe_irq.c b/drivers/gpu/drm/xe/xe_irq.c
> > index eae190cb0969..d6b001f6a3c5 100644
> > --- a/drivers/gpu/drm/xe/xe_irq.c
> > +++ b/drivers/gpu/drm/xe/xe_irq.c
> > @@ -309,6 +309,13 @@ static irqreturn_t xelp_irq_handler(int irq, void *arg)
> >   	unsigned long intr_dw[2];
> >   	u32 identity[32];
> > +	spin_lock_irq(&xe->irq.lock);
> > +	if (!xe->irq.enabled) {
> > +		spin_unlock_irq(&xe->irq.lock);
> > +		return IRQ_NONE;
> > +	}
> > +	spin_unlock_irq(&xe->irq.lock);
> 
> I guess spin_lock() or spin_lock_irq_save() here and below, since this is
> inside hard irq?

I had considered saving and restoring the flags, but exactly because we are
inside it and we sync before disabling, we don't have the risk of blindly
re-enable with the spin_unlock_irq(), no? or what could I be missing on
that way?

But well, maybe the spin_lock() and spin_unlock() are the right one?

I'm doing some experiments here.... so far with the flags I'm getting
some strange warnings...

> 
> > +
> >   	master_ctl = xelp_intr_disable(xe);
> >   	if (!master_ctl) {
> >   		xelp_intr_enable(xe, false);
> > @@ -371,6 +378,13 @@ static irqreturn_t dg1_irq_handler(int irq, void *arg)
> >   	/* TODO: This really shouldn't be copied+pasted */
> > +	spin_lock_irq(&xe->irq.lock);
> > +	if (!xe->irq.enabled) {
> > +		spin_unlock_irq(&xe->irq.lock);
> > +		return IRQ_NONE;
> > +	}
> > +	spin_unlock_irq(&xe->irq.lock);
> > +
> >   	master_tile_ctl = dg1_intr_disable(xe);
> >   	if (!master_tile_ctl) {
> >   		dg1_intr_enable(xe, false);
> > @@ -574,10 +588,14 @@ void xe_irq_shutdown(struct xe_device *xe)
> >   void xe_irq_suspend(struct xe_device *xe)
> >   {
> > +	int irq = to_pci_dev(xe->drm.dev)->irq;
> > +
> >   	spin_lock_irq(&xe->irq.lock);
> > -	xe->irq.enabled = false;
> > -	xe_irq_reset(xe);
> > +	xe->irq.enabled = false; /* no new irqs */
> >   	spin_unlock_irq(&xe->irq.lock);
> > +
> > +	synchronize_irq(irq); /* flush irqs */
> > +	xe_irq_reset(xe); /* turn irqs off */
> >   }
> >   void xe_irq_resume(struct xe_device *xe)
> > @@ -585,13 +603,15 @@ void xe_irq_resume(struct xe_device *xe)
> >   	struct xe_gt *gt;
> >   	int id;
> > -	spin_lock_irq(&xe->irq.lock);
> > +	/*
> > +	 * lock not needed:
> > +	 * 1. no irq will arrive before the postinstall
> > +	 * 2. display is not yet resumed
> > +	 */
> >   	xe->irq.enabled = true;
> >   	xe_irq_reset(xe);
> > -	xe_irq_postinstall(xe);
> > +	xe_irq_postinstall(xe); /* turn irqs on */
> >   	for_each_gt(gt, xe, id)
> >   		xe_irq_enable_hwe(gt);
> > -
> > -	spin_unlock_irq(&xe->irq.lock);
> >   }


More information about the Intel-xe mailing list