[Intel-xe] [PATCH 1/2] drm/xe/pm: Wrap pm_runtime_suspended with power.lock
Gupta, Anshuman
anshuman.gupta at intel.com
Wed Jun 21 09:40:55 UTC 2023
> -----Original Message-----
> From: Matthew Auld <matthew.william.auld at gmail.com>
> Sent: Wednesday, June 21, 2023 2:25 PM
> To: Gupta, Anshuman <anshuman.gupta at intel.com>
> Cc: Brost, Matthew <matthew.brost at intel.com>; Sundaresan, Sujaritha
> <sujaritha.sundaresan at intel.com>; intel-xe at lists.freedesktop.org; Vivi,
> Rodrigo <rodrigo.vivi at intel.com>
> Subject: Re: [Intel-xe] [PATCH 1/2] drm/xe/pm: Wrap
> pm_runtime_suspended with power.lock
>
> On Wed, 21 Jun 2023 at 08:54, Gupta, Anshuman
> <anshuman.gupta at intel.com> wrote:
> >
> >
> >
> > > -----Original Message-----
> > > From: Matthew Auld <matthew.william.auld at gmail.com>
> > > Sent: Monday, June 19, 2023 6:03 PM
> > > To: Gupta, Anshuman <anshuman.gupta at intel.com>
> > > Cc: Brost, Matthew <matthew.brost at intel.com>; Sundaresan, Sujaritha
> > > <sujaritha.sundaresan at intel.com>; intel-xe at lists.freedesktop.org;
> > > Vivi, Rodrigo <rodrigo.vivi at intel.com>
> > > Subject: Re: [Intel-xe] [PATCH 1/2] drm/xe/pm: Wrap
> > > pm_runtime_suspended with power.lock
> > >
> > > On Mon, 19 Jun 2023 at 12:18, Gupta, Anshuman
> > > <anshuman.gupta at intel.com> wrote:
> > > >
> > > >
> > > >
> > > > > -----Original Message-----
> > > > > From: Brost, Matthew <matthew.brost at intel.com>
> > > > > Sent: Saturday, June 17, 2023 1:35 AM
> > > > > To: Gupta, Anshuman <anshuman.gupta at intel.com>
> > > > > Cc: intel-xe at lists.freedesktop.org; Nilawar, Badal
> > > > > <badal.nilawar at intel.com>; Vivi, Rodrigo
> > > > > <rodrigo.vivi at intel.com>; Tauro, Riana <riana.tauro at intel.com>;
> > > > > Sundaresan, Sujaritha <sujaritha.sundaresan at intel.com>
> > > > > Subject: Re: [PATCH 1/2] drm/xe/pm: Wrap pm_runtime_suspended
> > > with
> > > > > power.lock
> > > > >
> > > > > On Fri, Jun 16, 2023 at 07:41:14PM +0530, Anshuman Gupta wrote:
> > > > > > pm_runtime_suspended() doc has requirement to use it with PM
> > > > > > lock of @dev to get the trusted value.
> > > > > > Wrap pm_runtime_suspended within
> xe_pm_runtime_suspended()
> > > with
> > > > > proper
> > > > > > locking.
> > > > > >
> > > > > > Signed-off-by: Anshuman Gupta <anshuman.gupta at intel.com>
> > > > > > ---
> > > > > > drivers/gpu/drm/xe/xe_pm.c | 16 ++++++++++++++--
> > > > > > 1 file changed, 14 insertions(+), 2 deletions(-)
> > > > > >
> > > > > > diff --git a/drivers/gpu/drm/xe/xe_pm.c
> > > > > > b/drivers/gpu/drm/xe/xe_pm.c index b7b57f10ba25..04c5353ba870
> > > > > > 100644
> > > > > > --- a/drivers/gpu/drm/xe/xe_pm.c
> > > > > > +++ b/drivers/gpu/drm/xe/xe_pm.c
> > > > > > @@ -210,11 +210,23 @@ int xe_pm_runtime_put(struct xe_device
> > > *xe)
> > > > > > return pm_runtime_put_autosuspend(xe->drm.dev);
> > > > > > }
> > > > > >
> > > > > > +static bool xe_pm_runtime_suspended(struct xe_device *xe) {
> > > > > > + unsigned long flags;
> > > > > > + bool suspended;
> > > > > > +
> > > > > > + spin_lock_irqsave(&xe->drm.dev->power.lock, flags);
> > > > > > + suspended = pm_runtime_suspended(xe->drm.dev);
> > > > > > + spin_unlock_irqrestore(&xe->drm.dev->power.lock, flags);
> > > > >
> > > > > A few things.
> > > > >
> > > > > First, you probably can make this spin_lock_irq /
> > > > > spin_unlock_irq as IRQ contexts are not really used in Xe it is
> > > > > doubtful this function would ever be called from an unknown
> context.
> > > > Thanks for review , I will fix this.
> > > > >
> > > > > Second, I have questions if this patch is actually doing
> > > > > anything useful. Yes, I can see doc which indicates that
> > > > > pm_runtime_suspended can only be trusted if the lock is held but
> > > > > looking at the code paths where xe_pm_runtime_suspended is
> > > > > called unsure how this lock
> > > helps.
> > > > >
> > > > > e.g. How is this better?
> > > > This is helpful in case, when pm_runtime_suspended() returns
> > > > status is RPM_SUSPENDING and it will be going to change to
> > > > RPM_SUPENDED,
> > > > rpm_suspend() changes the runtime status from RPM_ACTIVE ->
> > > > RPM_SUSPENDING -> driver->runtime_suspend() callback ->
> > > RPM_SUSPENDED
> > > > https://elixir.bootlin.com/linux/latest/source/drivers/base/power/
> > > > runt
> > > > ime.c#L558
> > > > rpm_suspend() caller has already grabbed the lock and RPM status
> > > changed to RPM_SUSPENDING, therefore Xe has to wait to grab the lock
> > > to get the trusted value, which is RPM_SUSPENDED.
> > > > Without a proper lock, the xe driver may access the vram while the
> > > > device was runtime suspended, which is a violation of PCIe
> > > > specs.> if (trusted value)
> > > > > now the trusted value can change, do something
> > > > The other way around is if Xe has grabbed the lock first then the
> > > > possible runtime status is either RPM_ACTIVE or RUNTIME_SUPENDED
> > > > for
> > > both Xe is increasing usage ref count explicitly to make sure the
> > > device is in an active state.
> > >
> > > From my reading of the RPM core code, it eventually drops the
> > > power.lock before calling any of the driver pm callbacks (see
> > > __rpm_callback() for example), and then only re-acquires it after. I
> > > think it mostly just protects the RPM state. If something is
> > > transitioning from RPM_ACTIVE -> RPM_SUSPENDING or whatever,
> > > grabbing the power.lock won't synchronise the actual operation,
> > > since the lock is not even being held when our callback is being
> > > run. If we want to actually synchronise an ongoing SUSPENDING or
> > > RESUMING you have to call something like
> > > pm_runtime_get_sync() I think.
> > All rpm_{idle,resume,suspend} PM core calls being called with power.lock
> held.
>
> They hold it yes, but my point is that it is eventually dropped and then re-
> acquired during some or all of those calls, like when calling into the driver
> resume or suspend callbacks. Did you take a look at __rpm_callback()? See
> below.
>
> > See below.
> > https://elixir.bootlin.com/linux/latest/source/drivers/base/power/runt
> > ime.c#L1140
> > https://elixir.bootlin.com/linux/latest/source/drivers/base/power/runt
> > ime.c#L1158
> > https://elixir.bootlin.com/linux/latest/source/drivers/base/power/runt
> > ime.c#L1102
> >
> > So during driver callback as well, power lock will be held.
>
> Here is what xe_pm_runtime_get() looks like on the 0 -> 1 transition with
> SUSPENDED -> RESUMING -> ACTIVE:
>
> xe_pm_runtime_get()
> pm_runtime_get_sync()
> __pm_runtime_resume()
> spin_lock_irqsave(power.lock, flags)
> retval = rpm_resume()
> __update_runtime_status(RPM_RESUMING)
> cb = RPM_GET_CALLBACK(runtime_resume);
> rpm_callback(cb)
> __rpm_callback(cb)
> spin_unlock(power.lock); /* !!! */
> ret = cb(dev) /* run driver callback */
> spin_lock(power.lock)
> ..........
> _update_runtime_status(dev, RPM_ACTIVE);
> spin_unlock_irqrestore(power.lock, flags)
>
> >
> > We can not use pm_runtime_get_sync() in the the runtime suspend call
> path as it will cause a deadlock.
> > See xe_device_mem_access_get() is even get called from runtime supend
> call patch as well.
> > This seems intentional to reuse the code.
>
> Right, and that is the major pain point in the below series. But if the PM code
> was indeed holding power.lock over driver callbacks then you would also see
> deadlocks, since our pm callbacks will call
> xe_pm_runtime_suspended() which also now acquires power.lock as per
> this patch.
My Bad I could not see earlier, thanks for correction, I will drop this patch.
Thanks,
Anshuman Gupta.
>
> > >
> > > Also like Matt said, even if this somehow all worked there is still
> > > the actual race window between checking if it is not suspended and
> > > then proceeding as if that is still going to be the case.
> > >
> > > FWIW there is an attempt to fix such issues here, although it's all quite
> nasty:
> > > https://patchwork.freedesktop.org/series/119213/
> > Thanks for pointing the series, will take look in to it.
>
> Thanks.
>
> >
> > Thanks,
> > Anshuman Gupta.
> >
> > >
> > > >
> > > > Thanks ,
> > > > Anshuman Gupta.
> > > > >
> > > > >
> > > > > Than?
> > > > > if (untrusted value)
> > > > > do something
> > > > > Also if you look all over kernel pm_runtime_suspended is called
> > > > > frequently
> > > > > dev->power.lock.
> > > > >
> > > > > Matt
> > > > >
> > > > > > +
> > > > > > + return suspended;
> > > > > > +}
> > > > > > +
> > > > > > /* Return true if resume operation happened and usage count
> > > > > > was increased */ bool
> > > > > > xe_pm_runtime_resume_if_suspended(struct
> > > > > xe_device
> > > > > > *xe) {
> > > > > > /* In case we are suspended we need to immediately wake up */
> > > > > > - if (pm_runtime_suspended(xe->drm.dev))
> > > > > > + if (xe_pm_runtime_suspended(xe))
> > > > > > return !pm_runtime_resume_and_get(xe->drm.dev);
> > > > > >
> > > > > > return false;
> > > > > > @@ -222,6 +234,6 @@ bool
> > > xe_pm_runtime_resume_if_suspended(struct
> > > > > > xe_device *xe)
> > > > > >
> > > > > > int xe_pm_runtime_get_if_active(struct xe_device *xe) {
> > > > > > - WARN_ON(pm_runtime_suspended(xe->drm.dev));
> > > > > > + WARN_ON(xe_pm_runtime_suspended(xe));
> > > > > > return pm_runtime_get_if_active(xe->drm.dev, true); }
> > > > > > --
> > > > > > 2.38.0
> > > > > >
More information about the Intel-xe
mailing list