[PATCH 4/9] drm/xe: Move xe_irq runtime suspend and resume out of lockdep

Tue Mar 5 22:45:44 UTC 2024

On Tue, Mar 05, 2024 at 11:07:37AM +0000, Matthew Auld wrote:
> On 04/03/2024 18:21, Rodrigo Vivi wrote:
> > Now that mem_access xe_pm_runtime_lockdep_map was moved to protect all
> > the sync resume calls lockdep is saying:
> > 
> >   Possible unsafe locking scenario:
> > 
> >         CPU0                    CPU1
> >         ----                    ----
> >    lock(xe_pm_runtime_lockdep_map);
> >                                 lock(&power_domains->lock);
> >                                 lock(xe_pm_runtime_lockdep_map);
> >    lock(&power_domains->lock);
> > 
> > -> #1 (xe_pm_runtime_lockdep_map){+.+.}-{0:0}:
> >         xe_pm_runtime_resume_and_get+0x6a/0x190 [xe]
> >         release_async_put_domains+0x26/0xa0 [xe]
> >         intel_display_power_put_async_work+0xcb/0x1f0 [xe]
> > 
> > -> #0 (&power_domains->lock){+.+.}-{4:4}:
> >         __lock_acquire+0x3259/0x62c0
> >         lock_acquire+0x19b/0x4c0
> >         __mutex_lock+0x16b/0x1a10
> >         intel_display_power_is_enabled+0x1f/0x40 [xe]
> >         gen11_display_irq_reset+0x1f2/0xcc0 [xe]
> >         xe_irq_reset+0x43d/0x1cb0 [xe]
> >         xe_irq_resume+0x52/0x660 [xe]
> >         xe_pm_runtime_resume+0x7d/0xdc0 [xe
> > 
> > This is likely a false positive.
> > 
> > This lockdep is created to protect races from the inner callers
> 
> There is no real lock here so it doesn't protect anything AFAIK. It is just
> about mapping the hidden dependencies between locks held when waking up the
> device and locks acquired in the resume and suspend callbacks.

indeed a bad phrase. something like
'This lockdep is created to warn us if we are at risk of introducing inner callers"
would make it better?

> 
> > of get-and-resume-sync that are within holding various memory access locks
> > with the resume and suspend itself that can also be trying to grab these
> > memory access locks.
> > 
> > This is not the case here, for sure. The &power_domains->lock seems to be
> > sufficient to protect any race and there's no counter part to get deadlocked
> > with.
> 
> What is meant by "race" here? The lockdep splat is saying that one or both
> of the resume or suspend callbacks is grabbing some lock, but that same lock
> is also held when potentially waking up the device. From lockdep POV that is
> a potential deadlock.

The lock is &power_domains->lock only, that could be grabbed at both suspend
and resume. But even though we are not trusting that only one of the operations
can help simultaneously, what are the other lock that could be possibly be
hold in a way to cause this theoretical deadlock?

> 
> If we are saying that it is impossible to actually wake up the device in
> this particular case then can we rather make caller use _noresume() or
> ifactive()?

I'm trying to avoid touching the i915-display runtime-pm code. :/

At some point I even thought about making all the i915-display bogus on xe
and making the runtime_pm idle to check for display connected, but there
are so many cases where the code take different decisions if runtime_pm
is in-use vs not that it would complicate things a bit anyway.

> 
> > 
> > Also worth to mention that on i915, intel_display_power_put_async_work
> > also gets and resume synchronously and the runtime pm get/put
> > also resets the irq and that code was never problematic.
> > 
> > Cc: Matthew Auld <matthew.auld at intel.com>
> > Signed-off-by: Rodrigo Vivi <rodrigo.vivi at intel.com>
> > ---
> >   drivers/gpu/drm/xe/xe_pm.c | 7 +++++--
> >   1 file changed, 5 insertions(+), 2 deletions(-)
> > 
> > diff --git a/drivers/gpu/drm/xe/xe_pm.c b/drivers/gpu/drm/xe/xe_pm.c
> > index b534a194a9ef..919250e38ae0 100644
> > --- a/drivers/gpu/drm/xe/xe_pm.c
> > +++ b/drivers/gpu/drm/xe/xe_pm.c
> > @@ -347,7 +347,10 @@ int xe_pm_runtime_suspend(struct xe_device *xe)
> >   			goto out;
> >   	}
> > +	lock_map_release(&xe_pm_runtime_lockdep_map);
> >   	xe_irq_suspend(xe);
> > +	xe_pm_write_callback_task(xe, NULL);
> > +	return 0;
> >   out:
> >   	lock_map_release(&xe_pm_runtime_lockdep_map);
> >   	xe_pm_write_callback_task(xe, NULL);
> > @@ -369,6 +372,8 @@ int xe_pm_runtime_resume(struct xe_device *xe)
> >   	/* Disable access_ongoing asserts and prevent recursive pm calls */
> >   	xe_pm_write_callback_task(xe, current);
> > +	xe_irq_resume(xe);
> > +
> >   	lock_map_acquire(&xe_pm_runtime_lockdep_map);
> >   	/*
> > @@ -395,8 +400,6 @@ int xe_pm_runtime_resume(struct xe_device *xe)
> >   			goto out;
> >   	}
> > -	xe_irq_resume(xe);
> > -
> >   	for_each_gt(gt, xe, id)
> >   		xe_gt_resume(gt);