[PATCH 4/9] drm/xe: Move xe_irq runtime suspend and resume out of lockdep

Wed Mar 6 18:56:37 UTC 2024

On 06/03/2024 17:49, Rodrigo Vivi wrote:
> On Wed, Mar 06, 2024 at 04:04:45PM +0000, Matthew Auld wrote:
>> On 05/03/2024 22:45, Rodrigo Vivi wrote:
>>> On Tue, Mar 05, 2024 at 11:07:37AM +0000, Matthew Auld wrote:
>>>> On 04/03/2024 18:21, Rodrigo Vivi wrote:
>>>>> Now that mem_access xe_pm_runtime_lockdep_map was moved to protect all
>>>>> the sync resume calls lockdep is saying:
>>>>>
>>>>>     Possible unsafe locking scenario:
>>>>>
>>>>>           CPU0                    CPU1
>>>>>           ----                    ----
>>>>>      lock(xe_pm_runtime_lockdep_map);
>>>>>                                   lock(&power_domains->lock);
>>>>>                                   lock(xe_pm_runtime_lockdep_map);
>>>>>      lock(&power_domains->lock);
>>>>>
>>>>> -> #1 (xe_pm_runtime_lockdep_map){+.+.}-{0:0}:
>>>>>           xe_pm_runtime_resume_and_get+0x6a/0x190 [xe]
>>>>>           release_async_put_domains+0x26/0xa0 [xe]
>>>>>           intel_display_power_put_async_work+0xcb/0x1f0 [xe]
>>>>>
>>>>> -> #0 (&power_domains->lock){+.+.}-{4:4}:
>>>>>           __lock_acquire+0x3259/0x62c0
>>>>>           lock_acquire+0x19b/0x4c0
>>>>>           __mutex_lock+0x16b/0x1a10
>>>>>           intel_display_power_is_enabled+0x1f/0x40 [xe]
>>>>>           gen11_display_irq_reset+0x1f2/0xcc0 [xe]
>>>>>           xe_irq_reset+0x43d/0x1cb0 [xe]
>>>>>           xe_irq_resume+0x52/0x660 [xe]
>>>>>           xe_pm_runtime_resume+0x7d/0xdc0 [xe
>>>>>
>>>>> This is likely a false positive.
>>>>>
>>>>> This lockdep is created to protect races from the inner callers
>>>>
>>>> There is no real lock here so it doesn't protect anything AFAIK. It is just
>>>> about mapping the hidden dependencies between locks held when waking up the
>>>> device and locks acquired in the resume and suspend callbacks.
>>>
>>> indeed a bad phrase. something like
>>> 'This lockdep is created to warn us if we are at risk of introducing inner callers"
>>> would make it better?
>>
>> Yeah, or maybe something like:
>>
>> "The lockdep annotations will warn if any lock held when potentially waking
>> up the device, can also be acquired in either of the resume or suspend pm
>> callbacks".
>>
>> ?
>>
>>>
>>>>
>>>>> of get-and-resume-sync that are within holding various memory access locks
>>>>> with the resume and suspend itself that can also be trying to grab these
>>>>> memory access locks.
>>>>>
>>>>> This is not the case here, for sure. The &power_domains->lock seems to be
>>>>> sufficient to protect any race and there's no counter part to get deadlocked
>>>>> with.
>>>>
>>>> What is meant by "race" here? The lockdep splat is saying that one or both
>>>> of the resume or suspend callbacks is grabbing some lock, but that same lock
>>>> is also held when potentially waking up the device. From lockdep POV that is
>>>> a potential deadlock.
>>>
>>> The lock is &power_domains->lock only, that could be grabbed at both suspend
>>> and resume. But even though we are not trusting that only one of the operations
>>> can help simultaneously, what are the other lock that could be possibly be
>>> hold in a way to cause this theoretical deadlock?
>>
>> I don't think there needs to be another lock here to deadlock. Also it
>> should be completely fine that both the resume and suspend callbacks acquire
>> that same lock. The issue is only when you are holding that same lock and
>> then try to wake up the device synchronously. If that can actually happen we
>> can hit deadlocks.
>>
>> Simplest example would be A -> A deadlock:
>>
>> lock(power->lock)
>> pm_get_sync()
>>     -> runtime_resume(xe)
>>       -> lock(power->lock)
> 
> this case is impossible with power_domains->lock because the get_sync
> is never called from inside a power_domains locked area.

I agree that you likely can't actually wake up the device here in 
practice, but it is still calling xe_pm_runtime_resume_and_get() while 
holding the domains->lock, which looks like a deadlock from lockdep pov. 
In other places we instead just use noresume() or similar to make that 
clear in the code and to lockdep (and reviewers) that it is impossible. 
It is a trade-off between false positives and having a design you can 
more easily validate and be somewhat confident doesn't have deadlocks; 
every caller of xe_pm_runtime_resume_and_get() or 
xe_pm_runtime_get_sync() etc. is assumed to always be able to 
potentially wake up the device from a lockdep pov. I thought that made 
good sense overall with d3cold on dgpu now wanting to do stuff like GPU 
submission, allocating memory, and grabbing all kinds of scary locks 
etc, but if you think the annotations are getting in the way too much, I 
guess we can just nuke the annotations or don't use them for display?

> 
>>
>> A more nasty example with A -> B, B -> A deadlock (where did B come
>> from???):
>>
>>                            rpm-core worker
>> lock(power->lock)    |
>>                       | -> status = RPM_SUSPENDING
>>                       | -> runtime_suspend(xe)
>>                       |   -> lock(power->lock)
>> pm_get_sync()        |
>>    -> wait_for_worker |
> 
> again, this case is also impossible because the get_sync is never
> called from inside the power_domains locked area. On both sides that
> is called just from inner portions of the work and then it would be okay.
> 
> As every other current i915-display rpm handling.
> 
>>
>> The wait_for_worker is the waitqueue dance where it sees that runtime_status
>> is RPM_SUSPENDING or RPM_RESUMING and then goes to sleep until the status
>> changes to RPM_SUSPENDED or RPM_RESUMED. Once the worker completes it then
>> wakes up the sleeper. But that wait_for_worker thing creates a dependency
>> underneath. But here that wait dependency is hidden from lockdep AFAICT,
>> since there is no actual lock acquisition happening. There is exactly one
>> lock and two different contexts acquiring it, seems totally fine, so from
>> lock acquisition pov there is no issue.
>>
>> And yet it still deadlocks, since the rpm-core worker (or whatever is
>> running the async suspend) is stuck trying to acquire lock(power->lock), but
>> the other caller is already holding it and won't release it until the worker
>> completes. And this is where the xe pm annotations helps by basically
>> introducing a big-dumb-fake-lock to try to model the dependencies. The above
>> then actually ends up looking like:
>>
>>                            rpm-core worker
>> lock(power->lock)    |
>>                       | -> status = RPM_SUSPENDING
>>                       | -> runtime_suspend(xe)
>>                       |    map_acquire(pm)
>>                       |   -> lock(power->lock)
>>                       |
>> map_acquire(pm)      |
>> map_release(pm)      |
>> pm_get_sync()        |
>>   -> wait_for_worker  |
>>
>> So what we actually have and what lockdep will easily see as potentially
>> possible:
>>
>> lock(power->lock) -> map_acquire(pm)
>> map_acquire(pm) -> lock(power->lock)
>>
>> Which is now a simple locking inversion with A -> B, B -> A. But this is
>> actually a bit silly example with lock(power->lock) since lockdep will see
>> the simple A -> A, but point is there there might exist other locks that are
>> only taken in the suspend callback, for example, and since suspend is
>> usually always async and done from the rpm core worker I don't think lockdep
>> will see the potential deadlocks without the xe pm annotations (comes back
>> to the hidden wait_for_worker dependency).
>>
>> The other thing worth mentioning is that all pm_get_sync() calls are assumed
>> to be capable of waking up the device as per the xe pm annotations. The
>> advantage is that we don't need to hit the full wake-up-the-entire-device
>> slow path for every caller in a real run (probably not possible in single CI
>> run). We instead just need to hit both pm callbacks once in a given CI run
>> (very easy), to record the map_acquire(pm) -> locks-in-pm-callbacks
>> dependencies. And then so long as we also hit every pm_get_sync() call site,
>> which now doesn't need to actually wake the device up for real, we also then
>> get the locks-held-when-waking-up-device -> map_acquire(pm), which should be
>> enough for lockdep to find any clear locking inversions, like the above.
>> With that we can have pretty high confidence we don't have any potential
>> deadlocks. The disadvantage is potential false positives (like maybe in this
>> patch), but when you consider that nasty A -> B, B -> A example, I think it
>> is probably worth it IMO.
>>
>> Also a good read here:
>> https://blog.ffwll.ch/2020/08/lockdep-false-positives.html
> 
> yeap, I have read that entire series more than once. I would never challenge
> the lockdep itself.
> 
> My challenge here is with our annotation. If that is a so problematic case,
> perhaps we should then try to convince the linux core kernel folks to get
> this annotation inside the runtime pm calls itself?
> and likely under the power->lock?
> 
> Our annotation is very good for the various memory handling cases where
> we get the memory locks upfront and then the inner get_sync calls getting
> the same locks would certainly cause the deadlocks mentioned above.
> 
>>
>>>
>>>>
>>>> If we are saying that it is impossible to actually wake up the device in
>>>> this particular case then can we rather make caller use _noresume() or
>>>> ifactive()?
>>>
>>> I'm trying to avoid touching the i915-display runtime-pm code. :/
>>>
>>> At some point I even thought about making all the i915-display bogus on xe
>>> and making the runtime_pm idle to check for display connected, but there
>>> are so many cases where the code take different decisions if runtime_pm
>>> is in-use vs not that it would complicate things a bit anyway.
>>>
>>>>
>>>>>
>>>>> Also worth to mention that on i915, intel_display_power_put_async_work
>>>>> also gets and resume synchronously and the runtime pm get/put
>>>>> also resets the irq and that code was never problematic.
>>>>>
>>>>> Cc: Matthew Auld <matthew.auld at intel.com>
>>>>> Signed-off-by: Rodrigo Vivi <rodrigo.vivi at intel.com>
>>>>> ---
>>>>>     drivers/gpu/drm/xe/xe_pm.c | 7 +++++--
>>>>>     1 file changed, 5 insertions(+), 2 deletions(-)
>>>>>
>>>>> diff --git a/drivers/gpu/drm/xe/xe_pm.c b/drivers/gpu/drm/xe/xe_pm.c
>>>>> index b534a194a9ef..919250e38ae0 100644
>>>>> --- a/drivers/gpu/drm/xe/xe_pm.c
>>>>> +++ b/drivers/gpu/drm/xe/xe_pm.c
>>>>> @@ -347,7 +347,10 @@ int xe_pm_runtime_suspend(struct xe_device *xe)
>>>>>     			goto out;
>>>>>     	}
>>>>> +	lock_map_release(&xe_pm_runtime_lockdep_map);
>>>>>     	xe_irq_suspend(xe);
>>>>> +	xe_pm_write_callback_task(xe, NULL);
>>>>> +	return 0;
>>>>>     out:
>>>>>     	lock_map_release(&xe_pm_runtime_lockdep_map);
>>>>>     	xe_pm_write_callback_task(xe, NULL);
>>>>> @@ -369,6 +372,8 @@ int xe_pm_runtime_resume(struct xe_device *xe)
>>>>>     	/* Disable access_ongoing asserts and prevent recursive pm calls */
>>>>>     	xe_pm_write_callback_task(xe, current);
>>>>> +	xe_irq_resume(xe);
>>>>> +
>>>>>     	lock_map_acquire(&xe_pm_runtime_lockdep_map);
>>>>>     	/*
>>>>> @@ -395,8 +400,6 @@ int xe_pm_runtime_resume(struct xe_device *xe)
>>>>>     			goto out;
>>>>>     	}
>>>>> -	xe_irq_resume(xe);
>>>>> -
>>>>>     	for_each_gt(gt, xe, id)
>>>>>     		xe_gt_resume(gt);