[PATCH] drm/xe: Kill missing outer runtime PM protection warning

Wed Sep 4 15:42:14 UTC 2024

On 04/09/2024 16:03, Rodrigo Vivi wrote:
> On Wed, Sep 04, 2024 at 01:49:47PM +0100, Matthew Auld wrote:
>> On 03/09/2024 23:38, Rodrigo Vivi wrote:
>>> This message was very useful to ensure that Xe was taking all
>>> the needed outer runtime pm references. However, at this point
>>> it is only a false positive. So, remove it.
>>>
>>> False positive cases:
>>>
>>> 1:
>>> [184.983389] xe ...: [drm] Missing outer runtime PM protection
>>> [snip]
>>> [184.984096]  drm_ioctl+0x2cf/0x580 [drm]
>>> [snip]
>>> [184.984710] xe 0000:00:02.0: Runtime PM usage count underflow!
>>>
>>> In this case the underflow is the problem since we are sure that
>>> the ioctl is protected. But something else is abusing the 'put'
>>> calls.
>>>
>>> 2:
>>> rpm_status:           0000:03:00.0 status=RPM_SUSPENDING
>>> console:              xe_bo_evict_all (called from suspend)
>>> xe_sched_job_create:  dev=0000:03:00.0, ...
>>> xe_sched_job_exec:    dev=0000:03:00.0, ...
>>> xe_pm_runtime_put:    dev=0000:03:00.0, ...
>>> xe_sched_job_run:     dev=0000:03:00.0, ...
>>> rpm_usage:            0000:03:00.0 flags-0 cnt-2  ...
>>> rpm_usage:            0000:03:00.0 flags-0 cnt-2  ...
>>> rpm_usage:            0000:03:00.0 flags-0 cnt-2  ...
>>> console:              xe 0000:03:00.0: [drm] Missing outer runtime
>>> 		      	 	       	     	     PM protection
>>> console:               xe_guc_ct_send+0x15/0x50 [xe]
>>> console:               guc_exec_queue_run_job+0x1509/0x3950 [xe]
>>> [snip]
>>> console:               drm_sched_run_job_work+0x649/0xc20
>>>
>>> At this point, BOs are getting evicted from VRAM with rpm
>>> usage-counter = 2, but rpm status = SUSPENDING.
>>> The xe->pm_callback_task won't be equal 'current' because this call is
>>> coming from a work queue.
>>>
>>> So, pm_runtime_get_if_active() will be called and return 0 because rpm
>>> status != ACTIVE (but equal SUSPENDING).
>>>
>>> The only way out is to just grab the reference and move on.
>>>
>>> Cc: Matthew Brost <matthew.brost at intel.com>
>>> Cc: Matthew Auld <matthew.auld at intel.com>
>>> Cc: Thomas Hellström <thomas.hellstrom at linux.intel.com>
>>> Signed-off-by: Rodrigo Vivi <rodrigo.vivi at intel.com>
>>> ---
>>>    drivers/gpu/drm/xe/xe_pm.c | 10 ++--------
>>>    1 file changed, 2 insertions(+), 8 deletions(-)
>>>
>>> diff --git a/drivers/gpu/drm/xe/xe_pm.c b/drivers/gpu/drm/xe/xe_pm.c
>>> index da68cd689a96..e1a5e43b0f34 100644
>>> --- a/drivers/gpu/drm/xe/xe_pm.c
>>> +++ b/drivers/gpu/drm/xe/xe_pm.c
>>> @@ -592,20 +592,14 @@ bool xe_pm_runtime_get_if_in_use(struct xe_device *xe)
>>>     * xe_pm_runtime_get_noresume - Bump runtime PM usage counter without resuming
>>>     * @xe: xe device instance
>>>     *
>>> - * This function should be used in inner places where it is surely already
>>> + * This function should *only* be used in inner places where it is surely already
>>>     * protected by outer-bound callers of `xe_pm_runtime_get`.
>>> - * It will warn if not protected.
>>>     * The reference should be put back after this function regardless, since it
>>>     * will always bump the usage counter, regardless.
>>>     */
>>>    void xe_pm_runtime_get_noresume(struct xe_device *xe)
>>>    {
>>> -	bool ref;
>>> -
>>> -	ref = xe_pm_runtime_get_if_in_use(xe);
>>> -
>>> -	if (drm_WARN(&xe->drm, !ref, "Missing outer runtime PM protection\n"))
>>> -		pm_runtime_get_noresume(xe->drm.dev);
>>
>> This has proven to find real bugs in the past, right? If so, it seems
>> unfortunate to drop this completely?
> 
> You do have a point....
> 
>> What about making it slightly more
>> fuzzy with something like:
>>
>> drm_WARN(!pm_read_callback_task() && !pm_runtime_active(), ...
> 
> I don't like inspecting the status directly because it can be racy,
> but in this particular case it would for sure avoid the case-2 of
> our false positives. But perhaps something like this:
> 
> @@ -600,12 +600,14 @@ bool xe_pm_runtime_get_if_in_use(struct xe_device *xe)
>    */
>   void xe_pm_runtime_get_noresume(struct xe_device *xe)
>   {
> +       struct device *dev = xe->drm.dev;
>          bool ref;
>   
>          ref = xe_pm_runtime_get_if_in_use(xe);
>   
> -       if (drm_WARN(&xe->drm, !ref, "Missing outer runtime PM protection\n"))
> -               pm_runtime_get_noresume(xe->drm.dev);
> +       if (drm_WARN(&xe->drm, !ref && dev->power.runtime_status != RPM_SUSPENDING,
> +                    "Missing outer runtime PM protection\n"))
> +               pm_runtime_get_noresume(dev);
> 
> 
>>
>> That should avoid the false positive, at the cost of not finding some real
>> bugs, but at least gives us something?
> 
> Well, right, this eliminates the false positive case 2.
> But still at least the false positive case 1.

I see that more as secondary issue due to some ref imbalance or 
something. Do you have some logs for 1 or link? I might have an idea.

> 
> The big problem that I saw with this warning at this moment, was that
> CI-buglog classified all the "Missing outer runtime PM protection"
> in a single bucket and that was ignored because we had the issue of
> the pending ggtt_node removal case to handle.
> 
> Meanwhile we masked all these bugs that I'm hunting now, because
> they all looked the ggtt_node removal.
> 
> But well, I guess we could workaround and continue the clean-up
> on the CI-buglog side and workaround false positives instead of
> simply removing this protection and going blindly forever on
> the unprotected potential new inner callers...
> 
> thought on the diff above? (&& dev->power.runtime_status != RPM_SUSPENDING)

Yeah seems fine, but maybe also need to check resuming also?

> 
> Thanks a lot,
> Rodrigo.
> 
>>
>>> +	pm_runtime_get_noresume(xe->drm.dev);
>>>    }
>>>    /**