[PATCH] drm/xe: Kill missing outer runtime PM protection warning
Matthew Auld
matthew.auld at intel.com
Wed Sep 4 15:42:14 UTC 2024
On 04/09/2024 16:03, Rodrigo Vivi wrote:
> On Wed, Sep 04, 2024 at 01:49:47PM +0100, Matthew Auld wrote:
>> On 03/09/2024 23:38, Rodrigo Vivi wrote:
>>> This message was very useful to ensure that Xe was taking all
>>> the needed outer runtime pm references. However, at this point
>>> it is only a false positive. So, remove it.
>>>
>>> False positive cases:
>>>
>>> 1:
>>> [184.983389] xe ...: [drm] Missing outer runtime PM protection
>>> [snip]
>>> [184.984096] drm_ioctl+0x2cf/0x580 [drm]
>>> [snip]
>>> [184.984710] xe 0000:00:02.0: Runtime PM usage count underflow!
>>>
>>> In this case the underflow is the problem since we are sure that
>>> the ioctl is protected. But something else is abusing the 'put'
>>> calls.
>>>
>>> 2:
>>> rpm_status: 0000:03:00.0 status=RPM_SUSPENDING
>>> console: xe_bo_evict_all (called from suspend)
>>> xe_sched_job_create: dev=0000:03:00.0, ...
>>> xe_sched_job_exec: dev=0000:03:00.0, ...
>>> xe_pm_runtime_put: dev=0000:03:00.0, ...
>>> xe_sched_job_run: dev=0000:03:00.0, ...
>>> rpm_usage: 0000:03:00.0 flags-0 cnt-2 ...
>>> rpm_usage: 0000:03:00.0 flags-0 cnt-2 ...
>>> rpm_usage: 0000:03:00.0 flags-0 cnt-2 ...
>>> console: xe 0000:03:00.0: [drm] Missing outer runtime
>>> PM protection
>>> console: xe_guc_ct_send+0x15/0x50 [xe]
>>> console: guc_exec_queue_run_job+0x1509/0x3950 [xe]
>>> [snip]
>>> console: drm_sched_run_job_work+0x649/0xc20
>>>
>>> At this point, BOs are getting evicted from VRAM with rpm
>>> usage-counter = 2, but rpm status = SUSPENDING.
>>> The xe->pm_callback_task won't be equal 'current' because this call is
>>> coming from a work queue.
>>>
>>> So, pm_runtime_get_if_active() will be called and return 0 because rpm
>>> status != ACTIVE (but equal SUSPENDING).
>>>
>>> The only way out is to just grab the reference and move on.
>>>
>>> Cc: Matthew Brost <matthew.brost at intel.com>
>>> Cc: Matthew Auld <matthew.auld at intel.com>
>>> Cc: Thomas Hellström <thomas.hellstrom at linux.intel.com>
>>> Signed-off-by: Rodrigo Vivi <rodrigo.vivi at intel.com>
>>> ---
>>> drivers/gpu/drm/xe/xe_pm.c | 10 ++--------
>>> 1 file changed, 2 insertions(+), 8 deletions(-)
>>>
>>> diff --git a/drivers/gpu/drm/xe/xe_pm.c b/drivers/gpu/drm/xe/xe_pm.c
>>> index da68cd689a96..e1a5e43b0f34 100644
>>> --- a/drivers/gpu/drm/xe/xe_pm.c
>>> +++ b/drivers/gpu/drm/xe/xe_pm.c
>>> @@ -592,20 +592,14 @@ bool xe_pm_runtime_get_if_in_use(struct xe_device *xe)
>>> * xe_pm_runtime_get_noresume - Bump runtime PM usage counter without resuming
>>> * @xe: xe device instance
>>> *
>>> - * This function should be used in inner places where it is surely already
>>> + * This function should *only* be used in inner places where it is surely already
>>> * protected by outer-bound callers of `xe_pm_runtime_get`.
>>> - * It will warn if not protected.
>>> * The reference should be put back after this function regardless, since it
>>> * will always bump the usage counter, regardless.
>>> */
>>> void xe_pm_runtime_get_noresume(struct xe_device *xe)
>>> {
>>> - bool ref;
>>> -
>>> - ref = xe_pm_runtime_get_if_in_use(xe);
>>> -
>>> - if (drm_WARN(&xe->drm, !ref, "Missing outer runtime PM protection\n"))
>>> - pm_runtime_get_noresume(xe->drm.dev);
>>
>> This has proven to find real bugs in the past, right? If so, it seems
>> unfortunate to drop this completely?
>
> You do have a point....
>
>> What about making it slightly more
>> fuzzy with something like:
>>
>> drm_WARN(!pm_read_callback_task() && !pm_runtime_active(), ...
>
> I don't like inspecting the status directly because it can be racy,
> but in this particular case it would for sure avoid the case-2 of
> our false positives. But perhaps something like this:
>
> @@ -600,12 +600,14 @@ bool xe_pm_runtime_get_if_in_use(struct xe_device *xe)
> */
> void xe_pm_runtime_get_noresume(struct xe_device *xe)
> {
> + struct device *dev = xe->drm.dev;
> bool ref;
>
> ref = xe_pm_runtime_get_if_in_use(xe);
>
> - if (drm_WARN(&xe->drm, !ref, "Missing outer runtime PM protection\n"))
> - pm_runtime_get_noresume(xe->drm.dev);
> + if (drm_WARN(&xe->drm, !ref && dev->power.runtime_status != RPM_SUSPENDING,
> + "Missing outer runtime PM protection\n"))
> + pm_runtime_get_noresume(dev);
>
>
>>
>> That should avoid the false positive, at the cost of not finding some real
>> bugs, but at least gives us something?
>
> Well, right, this eliminates the false positive case 2.
> But still at least the false positive case 1.
I see that more as secondary issue due to some ref imbalance or
something. Do you have some logs for 1 or link? I might have an idea.
>
> The big problem that I saw with this warning at this moment, was that
> CI-buglog classified all the "Missing outer runtime PM protection"
> in a single bucket and that was ignored because we had the issue of
> the pending ggtt_node removal case to handle.
>
> Meanwhile we masked all these bugs that I'm hunting now, because
> they all looked the ggtt_node removal.
>
> But well, I guess we could workaround and continue the clean-up
> on the CI-buglog side and workaround false positives instead of
> simply removing this protection and going blindly forever on
> the unprotected potential new inner callers...
>
> thought on the diff above? (&& dev->power.runtime_status != RPM_SUSPENDING)
Yeah seems fine, but maybe also need to check resuming also?
>
> Thanks a lot,
> Rodrigo.
>
>>
>>> + pm_runtime_get_noresume(xe->drm.dev);
>>> }
>>> /**
More information about the Intel-xe
mailing list