[PATCH 4/9] drm/xe: Move xe_irq runtime suspend and resume out of lockdep

Wed Mar 6 20:04:03 UTC 2024

On Wed, Mar 06, 2024 at 06:56:37PM +0000, Matthew Auld wrote:
> On 06/03/2024 17:49, Rodrigo Vivi wrote:
> > On Wed, Mar 06, 2024 at 04:04:45PM +0000, Matthew Auld wrote:
> > > On 05/03/2024 22:45, Rodrigo Vivi wrote:
> > > > On Tue, Mar 05, 2024 at 11:07:37AM +0000, Matthew Auld wrote:
> > > > > On 04/03/2024 18:21, Rodrigo Vivi wrote:
> > > > > > Now that mem_access xe_pm_runtime_lockdep_map was moved to protect all
> > > > > > the sync resume calls lockdep is saying:
> > > > > > 
> > > > > >     Possible unsafe locking scenario:
> > > > > > 
> > > > > >           CPU0                    CPU1
> > > > > >           ----                    ----
> > > > > >      lock(xe_pm_runtime_lockdep_map);
> > > > > >                                   lock(&power_domains->lock);
> > > > > >                                   lock(xe_pm_runtime_lockdep_map);
> > > > > >      lock(&power_domains->lock);
> > > > > > 
> > > > > > -> #1 (xe_pm_runtime_lockdep_map){+.+.}-{0:0}:
> > > > > >           xe_pm_runtime_resume_and_get+0x6a/0x190 [xe]
> > > > > >           release_async_put_domains+0x26/0xa0 [xe]
> > > > > >           intel_display_power_put_async_work+0xcb/0x1f0 [xe]
> > > > > > 
> > > > > > -> #0 (&power_domains->lock){+.+.}-{4:4}:
> > > > > >           __lock_acquire+0x3259/0x62c0
> > > > > >           lock_acquire+0x19b/0x4c0
> > > > > >           __mutex_lock+0x16b/0x1a10
> > > > > >           intel_display_power_is_enabled+0x1f/0x40 [xe]
> > > > > >           gen11_display_irq_reset+0x1f2/0xcc0 [xe]
> > > > > >           xe_irq_reset+0x43d/0x1cb0 [xe]
> > > > > >           xe_irq_resume+0x52/0x660 [xe]
> > > > > >           xe_pm_runtime_resume+0x7d/0xdc0 [xe
> > > > > > 
> > > > > > This is likely a false positive.
> > > > > > 
> > > > > > This lockdep is created to protect races from the inner callers
> > > > > 
> > > > > There is no real lock here so it doesn't protect anything AFAIK. It is just
> > > > > about mapping the hidden dependencies between locks held when waking up the
> > > > > device and locks acquired in the resume and suspend callbacks.
> > > > 
> > > > indeed a bad phrase. something like
> > > > 'This lockdep is created to warn us if we are at risk of introducing inner callers"
> > > > would make it better?
> > > 
> > > Yeah, or maybe something like:
> > > 
> > > "The lockdep annotations will warn if any lock held when potentially waking
> > > up the device, can also be acquired in either of the resume or suspend pm
> > > callbacks".
> > > 
> > > ?
> > > 
> > > > 
> > > > > 
> > > > > > of get-and-resume-sync that are within holding various memory access locks
> > > > > > with the resume and suspend itself that can also be trying to grab these
> > > > > > memory access locks.
> > > > > > 
> > > > > > This is not the case here, for sure. The &power_domains->lock seems to be
> > > > > > sufficient to protect any race and there's no counter part to get deadlocked
> > > > > > with.
> > > > > 
> > > > > What is meant by "race" here? The lockdep splat is saying that one or both
> > > > > of the resume or suspend callbacks is grabbing some lock, but that same lock
> > > > > is also held when potentially waking up the device. From lockdep POV that is
> > > > > a potential deadlock.
> > > > 
> > > > The lock is &power_domains->lock only, that could be grabbed at both suspend
> > > > and resume. But even though we are not trusting that only one of the operations
> > > > can help simultaneously, what are the other lock that could be possibly be
> > > > hold in a way to cause this theoretical deadlock?
> > > 
> > > I don't think there needs to be another lock here to deadlock. Also it
> > > should be completely fine that both the resume and suspend callbacks acquire
> > > that same lock. The issue is only when you are holding that same lock and
> > > then try to wake up the device synchronously. If that can actually happen we
> > > can hit deadlocks.
> > > 
> > > Simplest example would be A -> A deadlock:
> > > 
> > > lock(power->lock)
> > > pm_get_sync()
> > >     -> runtime_resume(xe)
> > >       -> lock(power->lock)
> > 
> > this case is impossible with power_domains->lock because the get_sync
> > is never called from inside a power_domains locked area.
> 
> I agree that you likely can't actually wake up the device here in practice,
> but it is still calling xe_pm_runtime_resume_and_get() while holding the
> domains->lock, which looks like a deadlock from lockdep pov.

dang, you are absolutely right. I'm sorry. That was exactly what lockep
got. From where I was looking the power_domains->lock I was confident
that it was already awake before getting to the locked areas, but lockep
has no ways of know it and we have a remaining case where although it is
protected, it still calls the get_sync case.

So I'm dropping this patch and adding this display patch:

--- a/drivers/gpu/drm/i915/display/intel_display_power.c
+++ b/drivers/gpu/drm/i915/display/intel_display_power.c
@@ -646,7 +646,7 @@ release_async_put_domains(struct i915_power_domains *power_domains,
         * power well disabling.
         */
        assert_rpm_raw_wakeref_held(rpm);
-       wakeref = intel_runtime_pm_get(rpm);
+       wakeref = intel_runtime_pm_get_if_in_use(rpm);

Since we are already asserting before that the wakeref is held, there's
absolutely no functional change here, but this will ensure that xe_pm lockdep
won't complain about possible deadlocks.

Fair enough?

> In other places
> we instead just use noresume() or similar to make that clear in the code and
> to lockdep (and reviewers) that it is impossible. It is a trade-off between
> false positives and having a design you can more easily validate and be
> somewhat confident doesn't have deadlocks; every caller of
> xe_pm_runtime_resume_and_get() or xe_pm_runtime_get_sync() etc. is assumed
> to always be able to potentially wake up the device from a lockdep pov. I
> thought that made good sense overall with d3cold on dgpu now wanting to do
> stuff like GPU submission, allocating memory, and grabbing all kinds of
> scary locks etc, but if you think the annotations are getting in the way too
> much, I guess we can just nuke the annotations or don't use them for
> display?

I will have another display case soon when moving towards d3cold sequence.
On that one I will try to scrutinize better and we can discuss in separate
and see if we find an easy alternative.

Thank you so much,
Rodrigo.

> 
> > 
> > > 
> > > A more nasty example with A -> B, B -> A deadlock (where did B come
> > > from???):
> > > 
> > >                            rpm-core worker
> > > lock(power->lock)    |
> > >                       | -> status = RPM_SUSPENDING
> > >                       | -> runtime_suspend(xe)
> > >                       |   -> lock(power->lock)
> > > pm_get_sync()        |
> > >    -> wait_for_worker |
> > 
> > again, this case is also impossible because the get_sync is never
> > called from inside the power_domains locked area. On both sides that
> > is called just from inner portions of the work and then it would be okay.
> > 
> > As every other current i915-display rpm handling.
> > 
> > > 
> > > The wait_for_worker is the waitqueue dance where it sees that runtime_status
> > > is RPM_SUSPENDING or RPM_RESUMING and then goes to sleep until the status
> > > changes to RPM_SUSPENDED or RPM_RESUMED. Once the worker completes it then
> > > wakes up the sleeper. But that wait_for_worker thing creates a dependency
> > > underneath. But here that wait dependency is hidden from lockdep AFAICT,
> > > since there is no actual lock acquisition happening. There is exactly one
> > > lock and two different contexts acquiring it, seems totally fine, so from
> > > lock acquisition pov there is no issue.
> > > 
> > > And yet it still deadlocks, since the rpm-core worker (or whatever is
> > > running the async suspend) is stuck trying to acquire lock(power->lock), but
> > > the other caller is already holding it and won't release it until the worker
> > > completes. And this is where the xe pm annotations helps by basically
> > > introducing a big-dumb-fake-lock to try to model the dependencies. The above
> > > then actually ends up looking like:
> > > 
> > >                            rpm-core worker
> > > lock(power->lock)    |
> > >                       | -> status = RPM_SUSPENDING
> > >                       | -> runtime_suspend(xe)
> > >                       |    map_acquire(pm)
> > >                       |   -> lock(power->lock)
> > >                       |
> > > map_acquire(pm)      |
> > > map_release(pm)      |
> > > pm_get_sync()        |
> > >   -> wait_for_worker  |
> > > 
> > > So what we actually have and what lockdep will easily see as potentially
> > > possible:
> > > 
> > > lock(power->lock) -> map_acquire(pm)
> > > map_acquire(pm) -> lock(power->lock)
> > > 
> > > Which is now a simple locking inversion with A -> B, B -> A. But this is
> > > actually a bit silly example with lock(power->lock) since lockdep will see
> > > the simple A -> A, but point is there there might exist other locks that are
> > > only taken in the suspend callback, for example, and since suspend is
> > > usually always async and done from the rpm core worker I don't think lockdep
> > > will see the potential deadlocks without the xe pm annotations (comes back
> > > to the hidden wait_for_worker dependency).
> > > 
> > > The other thing worth mentioning is that all pm_get_sync() calls are assumed
> > > to be capable of waking up the device as per the xe pm annotations. The
> > > advantage is that we don't need to hit the full wake-up-the-entire-device
> > > slow path for every caller in a real run (probably not possible in single CI
> > > run). We instead just need to hit both pm callbacks once in a given CI run
> > > (very easy), to record the map_acquire(pm) -> locks-in-pm-callbacks
> > > dependencies. And then so long as we also hit every pm_get_sync() call site,
> > > which now doesn't need to actually wake the device up for real, we also then
> > > get the locks-held-when-waking-up-device -> map_acquire(pm), which should be
> > > enough for lockdep to find any clear locking inversions, like the above.
> > > With that we can have pretty high confidence we don't have any potential
> > > deadlocks. The disadvantage is potential false positives (like maybe in this
> > > patch), but when you consider that nasty A -> B, B -> A example, I think it
> > > is probably worth it IMO.
> > > 
> > > Also a good read here:
> > > https://blog.ffwll.ch/2020/08/lockdep-false-positives.html
> > 
> > yeap, I have read that entire series more than once. I would never challenge
> > the lockdep itself.
> > 
> > My challenge here is with our annotation. If that is a so problematic case,
> > perhaps we should then try to convince the linux core kernel folks to get
> > this annotation inside the runtime pm calls itself?
> > and likely under the power->lock?
> > 
> > Our annotation is very good for the various memory handling cases where
> > we get the memory locks upfront and then the inner get_sync calls getting
> > the same locks would certainly cause the deadlocks mentioned above.
> > 
> > > 
> > > > 
> > > > > 
> > > > > If we are saying that it is impossible to actually wake up the device in
> > > > > this particular case then can we rather make caller use _noresume() or
> > > > > ifactive()?
> > > > 
> > > > I'm trying to avoid touching the i915-display runtime-pm code. :/
> > > > 
> > > > At some point I even thought about making all the i915-display bogus on xe
> > > > and making the runtime_pm idle to check for display connected, but there
> > > > are so many cases where the code take different decisions if runtime_pm
> > > > is in-use vs not that it would complicate things a bit anyway.
> > > > 
> > > > > 
> > > > > > 
> > > > > > Also worth to mention that on i915, intel_display_power_put_async_work
> > > > > > also gets and resume synchronously and the runtime pm get/put
> > > > > > also resets the irq and that code was never problematic.
> > > > > > 
> > > > > > Cc: Matthew Auld <matthew.auld at intel.com>
> > > > > > Signed-off-by: Rodrigo Vivi <rodrigo.vivi at intel.com>
> > > > > > ---
> > > > > >     drivers/gpu/drm/xe/xe_pm.c | 7 +++++--
> > > > > >     1 file changed, 5 insertions(+), 2 deletions(-)
> > > > > > 
> > > > > > diff --git a/drivers/gpu/drm/xe/xe_pm.c b/drivers/gpu/drm/xe/xe_pm.c
> > > > > > index b534a194a9ef..919250e38ae0 100644
> > > > > > --- a/drivers/gpu/drm/xe/xe_pm.c
> > > > > > +++ b/drivers/gpu/drm/xe/xe_pm.c
> > > > > > @@ -347,7 +347,10 @@ int xe_pm_runtime_suspend(struct xe_device *xe)
> > > > > >     			goto out;
> > > > > >     	}
> > > > > > +	lock_map_release(&xe_pm_runtime_lockdep_map);
> > > > > >     	xe_irq_suspend(xe);
> > > > > > +	xe_pm_write_callback_task(xe, NULL);
> > > > > > +	return 0;
> > > > > >     out:
> > > > > >     	lock_map_release(&xe_pm_runtime_lockdep_map);
> > > > > >     	xe_pm_write_callback_task(xe, NULL);
> > > > > > @@ -369,6 +372,8 @@ int xe_pm_runtime_resume(struct xe_device *xe)
> > > > > >     	/* Disable access_ongoing asserts and prevent recursive pm calls */
> > > > > >     	xe_pm_write_callback_task(xe, current);
> > > > > > +	xe_irq_resume(xe);
> > > > > > +
> > > > > >     	lock_map_acquire(&xe_pm_runtime_lockdep_map);
> > > > > >     	/*
> > > > > > @@ -395,8 +400,6 @@ int xe_pm_runtime_resume(struct xe_device *xe)
> > > > > >     			goto out;
> > > > > >     	}
> > > > > > -	xe_irq_resume(xe);
> > > > > > -
> > > > > >     	for_each_gt(gt, xe, id)
> > > > > >     		xe_gt_resume(gt);