[PATCH] drm/xe: Kill missing outer runtime PM protection warning

Wed Sep 4 16:06:57 UTC 2024

On Wed, Sep 04, 2024 at 04:42:14PM +0100, Matthew Auld wrote:
> On 04/09/2024 16:03, Rodrigo Vivi wrote:
> > On Wed, Sep 04, 2024 at 01:49:47PM +0100, Matthew Auld wrote:
> > > On 03/09/2024 23:38, Rodrigo Vivi wrote:
> > > > This message was very useful to ensure that Xe was taking all
> > > > the needed outer runtime pm references. However, at this point
> > > > it is only a false positive. So, remove it.
> > > > 
> > > > False positive cases:
> > > > 
> > > > 1:
> > > > [184.983389] xe ...: [drm] Missing outer runtime PM protection
> > > > [snip]
> > > > [184.984096]  drm_ioctl+0x2cf/0x580 [drm]
> > > > [snip]
> > > > [184.984710] xe 0000:00:02.0: Runtime PM usage count underflow!
> > > > 
> > > > In this case the underflow is the problem since we are sure that
> > > > the ioctl is protected. But something else is abusing the 'put'
> > > > calls.
> > > > 
> > > > 2:
> > > > rpm_status:           0000:03:00.0 status=RPM_SUSPENDING
> > > > console:              xe_bo_evict_all (called from suspend)
> > > > xe_sched_job_create:  dev=0000:03:00.0, ...
> > > > xe_sched_job_exec:    dev=0000:03:00.0, ...
> > > > xe_pm_runtime_put:    dev=0000:03:00.0, ...
> > > > xe_sched_job_run:     dev=0000:03:00.0, ...
> > > > rpm_usage:            0000:03:00.0 flags-0 cnt-2  ...
> > > > rpm_usage:            0000:03:00.0 flags-0 cnt-2  ...
> > > > rpm_usage:            0000:03:00.0 flags-0 cnt-2  ...
> > > > console:              xe 0000:03:00.0: [drm] Missing outer runtime
> > > > 		      	 	       	     	     PM protection
> > > > console:               xe_guc_ct_send+0x15/0x50 [xe]
> > > > console:               guc_exec_queue_run_job+0x1509/0x3950 [xe]
> > > > [snip]
> > > > console:               drm_sched_run_job_work+0x649/0xc20
> > > > 
> > > > At this point, BOs are getting evicted from VRAM with rpm
> > > > usage-counter = 2, but rpm status = SUSPENDING.
> > > > The xe->pm_callback_task won't be equal 'current' because this call is
> > > > coming from a work queue.
> > > > 
> > > > So, pm_runtime_get_if_active() will be called and return 0 because rpm
> > > > status != ACTIVE (but equal SUSPENDING).
> > > > 
> > > > The only way out is to just grab the reference and move on.
> > > > 
> > > > Cc: Matthew Brost <matthew.brost at intel.com>
> > > > Cc: Matthew Auld <matthew.auld at intel.com>
> > > > Cc: Thomas Hellström <thomas.hellstrom at linux.intel.com>
> > > > Signed-off-by: Rodrigo Vivi <rodrigo.vivi at intel.com>
> > > > ---
> > > >    drivers/gpu/drm/xe/xe_pm.c | 10 ++--------
> > > >    1 file changed, 2 insertions(+), 8 deletions(-)
> > > > 
> > > > diff --git a/drivers/gpu/drm/xe/xe_pm.c b/drivers/gpu/drm/xe/xe_pm.c
> > > > index da68cd689a96..e1a5e43b0f34 100644
> > > > --- a/drivers/gpu/drm/xe/xe_pm.c
> > > > +++ b/drivers/gpu/drm/xe/xe_pm.c
> > > > @@ -592,20 +592,14 @@ bool xe_pm_runtime_get_if_in_use(struct xe_device *xe)
> > > >     * xe_pm_runtime_get_noresume - Bump runtime PM usage counter without resuming
> > > >     * @xe: xe device instance
> > > >     *
> > > > - * This function should be used in inner places where it is surely already
> > > > + * This function should *only* be used in inner places where it is surely already
> > > >     * protected by outer-bound callers of `xe_pm_runtime_get`.
> > > > - * It will warn if not protected.
> > > >     * The reference should be put back after this function regardless, since it
> > > >     * will always bump the usage counter, regardless.
> > > >     */
> > > >    void xe_pm_runtime_get_noresume(struct xe_device *xe)
> > > >    {
> > > > -	bool ref;
> > > > -
> > > > -	ref = xe_pm_runtime_get_if_in_use(xe);
> > > > -
> > > > -	if (drm_WARN(&xe->drm, !ref, "Missing outer runtime PM protection\n"))
> > > > -		pm_runtime_get_noresume(xe->drm.dev);
> > > 
> > > This has proven to find real bugs in the past, right? If so, it seems
> > > unfortunate to drop this completely?
> > 
> > You do have a point....
> > 
> > > What about making it slightly more
> > > fuzzy with something like:
> > > 
> > > drm_WARN(!pm_read_callback_task() && !pm_runtime_active(), ...
> > 
> > I don't like inspecting the status directly because it can be racy,
> > but in this particular case it would for sure avoid the case-2 of
> > our false positives. But perhaps something like this:
> > 
> > @@ -600,12 +600,14 @@ bool xe_pm_runtime_get_if_in_use(struct xe_device *xe)
> >    */
> >   void xe_pm_runtime_get_noresume(struct xe_device *xe)
> >   {
> > +       struct device *dev = xe->drm.dev;
> >          bool ref;
> >          ref = xe_pm_runtime_get_if_in_use(xe);
> > -       if (drm_WARN(&xe->drm, !ref, "Missing outer runtime PM protection\n"))
> > -               pm_runtime_get_noresume(xe->drm.dev);
> > +       if (drm_WARN(&xe->drm, !ref && dev->power.runtime_status != RPM_SUSPENDING,
> > +                    "Missing outer runtime PM protection\n"))
> > +               pm_runtime_get_noresume(dev);
> > 
> > 
> > > 
> > > That should avoid the false positive, at the cost of not finding some real
> > > bugs, but at least gives us something?
> > 
> > Well, right, this eliminates the false positive case 2.
> > But still at least the false positive case 1.
> 
> I see that more as secondary issue due to some ref imbalance or something.
> Do you have some logs for 1 or link? I might have an idea.

https://intel-gfx-ci.01.org/tree/intel-xe/IGT_7986/shard-adlp-9/igt@kms_cursor_crc@cursor-onscreen-128x42.html

> 
> > 
> > The big problem that I saw with this warning at this moment, was that
> > CI-buglog classified all the "Missing outer runtime PM protection"
> > in a single bucket and that was ignored because we had the issue of
> > the pending ggtt_node removal case to handle.
> > 
> > Meanwhile we masked all these bugs that I'm hunting now, because
> > they all looked the ggtt_node removal.
> > 
> > But well, I guess we could workaround and continue the clean-up
> > on the CI-buglog side and workaround false positives instead of
> > simply removing this protection and going blindly forever on
> > the unprotected potential new inner callers...
> > 
> > thought on the diff above? (&& dev->power.runtime_status != RPM_SUSPENDING)
> 
> Yeah seems fine, but maybe also need to check resuming also?

hmm indeed! Thanks

> 
> > 
> > Thanks a lot,
> > Rodrigo.
> > 
> > > 
> > > > +	pm_runtime_get_noresume(xe->drm.dev);
> > > >    }
> > > >    /**