[Intel-xe] [PATCH v12 01/13] drm/xe: fix xe_device_mem_access_get() races
Gupta, Anshuman
anshuman.gupta at intel.com
Tue Jul 11 17:56:20 UTC 2023
> -----Original Message-----
> From: Auld, Matthew <matthew.auld at intel.com>
> Sent: Tuesday, July 11, 2023 4:36 PM
> To: Gupta, Anshuman <anshuman.gupta at intel.com>; intel-
> xe at lists.freedesktop.org
> Cc: Vivi, Rodrigo <rodrigo.vivi at intel.com>; Thomas Hellström
> <thomas.hellstrom at linux.intel.com>; Brost, Matthew
> <matthew.brost at intel.com>
> Subject: Re: [PATCH v12 01/13] drm/xe: fix xe_device_mem_access_get()
> races
>
> On 11/07/2023 10:00, Gupta, Anshuman wrote:
> >
> >
> >> -----Original Message-----
> >> From: Auld, Matthew <matthew.auld at intel.com>
> >> Sent: Tuesday, July 4, 2023 9:30 PM
> >> To: Gupta, Anshuman <anshuman.gupta at intel.com>; intel-
> >> xe at lists.freedesktop.org
> >> Cc: Vivi, Rodrigo <rodrigo.vivi at intel.com>; Thomas Hellström
> >> <thomas.hellstrom at linux.intel.com>; Brost, Matthew
> >> <matthew.brost at intel.com>
> >> Subject: Re: [PATCH v12 01/13] drm/xe: fix xe_device_mem_access_get()
> >> races
> >>
> >> On 04/07/2023 16:29, Gupta, Anshuman wrote:
> >>>
> >>>
> >>> On 7/4/2023 4:55 PM, Matthew Auld wrote:
> >>>> On 30/06/2023 16:22, Gupta, Anshuman wrote:
> >>>>>
> >>>>>
> >>>>> On 6/26/2023 4:20 PM, Matthew Auld wrote:
> >>>>>> It looks like there is at least one race here, given that the
> >>>>>> pm_runtime_suspended() check looks to return false if we are in
> >>>>>> the process of suspending the device (RPM_SUSPENDING vs
> >> RPM_SUSPENDED).
> >>>>>> We later also do xe_pm_runtime_get_if_active(), but since the
> >>>>>> device is suspending or has now suspended, this doesn't do anything
> either.
> >>>>>> Following from this we can potentially return from
> >>>>>> xe_device_mem_access_get() with the device suspended or about
> to
> >>>>>> be, leading to broken behaviour.
> >>>>>>
> >>>>>> Attempt to fix this by always grabbing the runtime ref when our
> >>>>>> internal ref transitions from 0 -> 1. The hard part is then
> >>>>>> dealing with the runtime_pm callbacks also calling
> >>>>>> xe_device_mem_access_get() and deadlocking, which the
> >>>>>> pm_runtime_suspended() check prevented.
> >>>>>>
> >>>>>> v2:
> >>>>>> - ct->lock looks to be primed with fs_reclaim, so holding that
> >>>>>> and then
> >>>>>> allocating memory will cause lockdep to complain. Now that
> >>>>>> we
> >>>>>> unconditionally grab the mem_access.lock around
> >>>>>> mem_access_{get,put}, we
> >>>>>> need to change the ordering wrt to grabbing the ct->lock,
> >>>>>> since some of
> >>>>>> the runtime_pm routines can allocate memory (or at least
> >>>>>> that's what
> >>>>>> lockdep seems to suggest). Hopefully not a big deal. It
> >>>>>> might be that
> >>>>>> there were already issues with this, just that the atomics
> >>>>>> where
> >>>>>> "hiding" the potential issues.
> >>>>>> v3:
> >>>>>> - Use Thomas Hellström' idea with tracking the active task
> >>>>>> that is
> >>>>>> executing in the resume or suspend callback, in order to
> >>>>>> avoid
> >>>>>> recursive resume/suspend calls deadlocking on itself.
> >>>>>> - Split the ct->lock change.
> >>>>>> v4:
> >>>>>> - Add smb_mb() around accessing the pm_callback_task for extra
> >>>>>> safety.
> >>>>>> (Thomas Hellström)
> >>>>>> v5:
> >>>>>> - Clarify the kernel-doc for the mem_access.lock, given that
> >>>>>> it is quite
> >>>>>> strange in what it protects (data vs code). The real
> >>>>>> motivation is to
> >>>>>> aid lockdep. (Rodrigo Vivi)
> >>>>>> v6:
> >>>>>> - Split out the lock change. We still want this as a lockdep
> >>>>>> aid but
> >>>>>> only for the xe_device_mem_access_get() path. Sticking a
> >>>>>> lock on the
> >>>>>> put() looks be a no-go, also the runtime_put() there is
> >>>>>> always async.
> >>>>>> - Now that the lock is gone move to atomics and rely on the pm
> >>>>>> code
> >>>>>> serialising multiple callers on the 0 -> 1 transition.
> >>>>>> - g2h_worker_func() looks to be the next issue, given that
> >>>>>> suspend-resume callbacks are using CT, so try to handle that.
> >>>>>> v7:
> >>>>>> - Add xe_device_mem_access_get_if_ongoing(), and use it in
> >>>>>> g2h_worker_func().
> >>>>>>
> >>>>>> Closes: https://gitlab.freedesktop.org/drm/xe/kernel/-/issues/258
> >>>>>> Signed-off-by: Matthew Auld <matthew.auld at intel.com>
> >>>>>> Cc: Rodrigo Vivi <rodrigo.vivi at intel.com>
> >>>>>> Cc: Thomas Hellström <thomas.hellstrom at linux.intel.com>
> >>>>>> Cc: Matthew Brost <matthew.brost at intel.com>
> >>>>>> Cc: Anshuman Gupta <anshuman.gupta at intel.com>
> >>>>>> ---
> >>>>>> drivers/gpu/drm/xe/xe_device.c | 58
> >>>>>> +++++++++++++++++++-----
> >>>>>> drivers/gpu/drm/xe/xe_device.h | 12 ++---
> >>>>>> drivers/gpu/drm/xe/xe_device_types.h | 6 +++
> >>>>>> drivers/gpu/drm/xe/xe_guc_ct.c | 34 +++++++++++++-
> >>>>>> drivers/gpu/drm/xe/xe_pm.c | 66
> >>>>>> +++++++++++++++++++---------
> >>>>>> drivers/gpu/drm/xe/xe_pm.h | 3 +-
> >>>>>> 6 files changed, 134 insertions(+), 45 deletions(-)
> >>>>>>
> >>>>>> diff --git a/drivers/gpu/drm/xe/xe_device.c
> >>>>>> b/drivers/gpu/drm/xe/xe_device.c index
> c7985af85a53..1dc552da434f
> >>>>>> 100644
> >>>>>> --- a/drivers/gpu/drm/xe/xe_device.c
> >>>>>> +++ b/drivers/gpu/drm/xe/xe_device.c
> >>>>>> @@ -411,27 +411,61 @@ u32 xe_device_ccs_bytes(struct xe_device
> >>>>>> *xe,
> >>>>>> u64 size)
> >>>>>> DIV_ROUND_UP(size, NUM_BYTES_PER_CCS_BYTE) : 0;
> >>>>>> }
> >>>>>> +bool xe_device_mem_access_ongoing(struct xe_device *xe) {
> >>>>>> + if (xe_pm_read_callback_task(xe) != NULL)
> >>>>>> + return true;
> >>>>>> +
> >>>>>> + return atomic_read(&xe->mem_access.ref); }
> >>>>>> +
> >>>>>> +void xe_device_assert_mem_access(struct xe_device *xe) {
> >>>>>> + XE_WARN_ON(!xe_device_mem_access_ongoing(xe));
> >>>>>> +}
> >>>>>> +
> >>>>>> +bool xe_device_mem_access_get_if_ongoing(struct xe_device
> *xe) {
> >>>>>> + return atomic_inc_not_zero(&xe->mem_access.ref);
> >>>>>> +}
> >>>>>> +
> >>>>>> void xe_device_mem_access_get(struct xe_device *xe)
> >>>>>> {
> >>>>>> - bool resumed = xe_pm_runtime_resume_if_suspended(xe);
> >>>>>> - int ref = atomic_inc_return(&xe->mem_access.ref);
> >>>>>> + /*
> >>>>>> + * This looks racy, but should be fine since the
> >>>>>> pm_callback_task only
> >>>>>> + * transitions from NULL -> current (and back to NULL
> >>>>>> +again),
> >>>>>> during the
> >>>>>> + * runtime_resume() or runtime_suspend() callbacks, for
> >>>>>> +which
> >>>>>> there can
> >>>>>> + * only be a single one running for our device. We only need
> >>>>>> +to
> >>>>>> prevent
> >>>>>> + * recursively calling the runtime_get or runtime_put from
> >>>>>> +those
> >>>>>> + * callbacks, as well as preventing triggering any
> >>>>>> +access_ongoing
> >>>>>> + * asserts.
> >>>>>> + */
> >>>>> two runtime_suspend() can run in parallel for two different pci
> >>>>> device those worker thread pooled by pm_wq workqueue, it is not
> >>>>> guaranteed that tast_struct will be different for two worker
> >>>>> spawned by same pm_wq ?
> >>>>
> >>>> IIUC we only use pm_wq for the async put(). If somehow tast_struct
> >>>> can be the same for different workers when using pm_wq (I'm not
> >>>> sure tbh), then I think it's fine since all put() must be balanced
> >>>> with a get(), and all get() are handled directly in the caller (not
> >>>> pm_wq) since it's non-async, and must first wait for any running
> callback.
> >>> Agree with that.
> >>>>
> >>>>>
> >>>>>> + if (xe_pm_read_callback_task(xe) == current)
> >>>>>> + return;
> >>>>>> - if (ref == 1)
> >>>>>> - xe->mem_access.hold_rpm =
> >>>>>> xe_pm_runtime_get_if_active(xe);
> >>>>>> + if (!atomic_inc_not_zero(&xe->mem_access.ref)) {
> >>>>>> + bool hold_rpm = xe_pm_runtime_resume_and_get(xe);
> >>>>>> + int ref;
> >>>>>> - /* The usage counter increased if device was immediately
> >>>>>> resumed */
> >>>>>> - if (resumed)
> >>>>>> - xe_pm_runtime_put(xe);
> >>>>>> -
> >>>>>> - XE_WARN_ON(ref == S32_MAX);
> >>>>>> + ref = atomic_inc_return(&xe->mem_access.ref);
> >>>>>> + if (ref == 1)
> >>>>>> + xe->mem_access.hold_rpm = hold_rpm;
> >>>>>> + else
> >>>>>> + xe_pm_runtime_put(xe);
> >>> AFAIU above is not possible ?
> >>
> >> Do you mean the xe_pm_runtime_put() part? On the 0 -> 1 transition we
> >> just always call xe_pm_runtime_resume_and_get(). That serialises
> >> multiple callers such that exactly one runs our rpm resume callback
> >> and the rest all block waiting for it to transition into RPM_ACTIVE.
> >> However once we return from
> >> xe_pm_runtime_resume_and_get() there are potentially several
> >> different callers holding an rpm ref. We only need the one rpm ref
> >> for our needs, so we simply drop the rest here keeping the one that first
> incremented the ref.
> > Thanks for explanation.
> > I am still not clear with how are we handling case of error return by
> xe_pm_runtime_resume_and_get() ?
> > If one of two simultaneous call of xe_device_mem_access_get() fails at
> > xe_pm_runtime_resume_and_get() then it drops the rpm ref of other
> passed call as well ?
>
>
> I guess we could do something like:
>
> - if (ref == 1)
> + if (ref == 1 && hold_rpm)
> xe->mem_access.hold_rpm = hold_rpm;
> - else
> + else if (hold_rpm)
> xe_pm_runtime_put(xe);
>
> Or we can just always grab the rpm:
>
> - if (!atomic_inc_not_zero(&xe->mem_access.ref)) {
> - bool hold_rpm = xe_pm_runtime_resume_and_get(xe);
> - int ref;
> -
> - ref = atomic_inc_return(&xe->mem_access.ref);
> - if (ref == 1)
> - xe->mem_access.hold_rpm = hold_rpm;
> - else
> - xe_pm_runtime_put(xe);
> - } else {
> - XE_WARN_ON(atomic_read(&xe->mem_access.ref) == S32_MAX);
> - }
> + xe_pm_runtime_resume_and_get(xe);
> + ref = atomic_inc_return(&xe->mem_access.ref);
> + XE_WARN_ON(ref == S32_MAX);
IMO this is simple and better.
>
> lock_map_release(&xe_device_mem_access_lockdep_map);
> }
> @@ -489,8 +480,7 @@ void xe_device_mem_access_put(struct xe_device
> *xe)
> return;
>
> ref = atomic_dec_return(&xe->mem_access.ref);
> - if (ref == 0 && xe->mem_access.hold_rpm)
> - xe_pm_runtime_put(xe);
> + xe_pm_runtime_put(xe);
>
>
> And then also remove hold_rpm stuff. But either way there is still the big
> issue of what to do if the rpm get fails. If it fails as you say, the driver is still
> broken even with the above.
AFAIU we can't bail out from such a state , printing a kernel Error will suffice here.
Referring runtime documentation,
* If the resume callback returns an error code, the PM core regards this as a
fatal error and will refuse to run the helper functions described in Section
4 for the device, until its status is directly set to either 'active', or
'suspended' (by means of special helper functions provided by the PM core
for this purpose).
*https://www.kernel.org/doc/Documentation/power/runtime_pm.txt
>
> > That eventually provide access to PCI bar without RPM protection.
> > We should also check returned value from
> > xe_pm_runtime_resume_and_get() before calling
> > xe_pm_runtime_put() as pm_runtime_resume_and_get() also drops the
> > device usage count on error case.
>
> So do we already know how we want to handle such failures from callers
> pov? Currently there is no return value for xe_device_mem_access_get(), so
> caller currently has zero clue, and any device access will be end badly
> anyway. But I wasn't really trying to solve that in this series.
There is no way to handle runtime PM error, my comment was just about the
nonessential xe_pm_runtime_put().
BR,
Anshuman Gupta.
>
> Do we just want to return an error if access_get fails? If so handling the
> above becomes quite simple, with something like the following in
> xe_device_mem_access_get():
>
> + if (!xe_pm_runtime_resume_and_get(xe))
> + return false;
>
> > Br,
> > Anshuman Gupta.
> >>
> >>>>>> + } else {
> >>>>>> + XE_WARN_ON(atomic_read(&xe->mem_access.ref) ==
> S32_MAX);
> >>>>>> + }
> >>>>>> }
> >>>>>> void xe_device_mem_access_put(struct xe_device *xe)
> >>>>>> {
> >>>>>> - bool hold = xe->mem_access.hold_rpm;
> >>>>>> - int ref = atomic_dec_return(&xe->mem_access.ref);
> >>>>>> + int ref;
> >>>>>> - if (!ref && hold)
> >>>>>> + if (xe_pm_read_callback_task(xe) == current)
> >>>>>> + return;
> >>>>>> +
> >>>>>> + ref = atomic_dec_return(&xe->mem_access.ref);
> >>>>>> + if (ref == 0 && xe->mem_access.hold_rpm)
> >>>>>> xe_pm_runtime_put(xe);
> >>>>>> XE_WARN_ON(ref < 0);
> >>>>> /snip
> >>>>>> +
> >>>>>> int xe_pm_runtime_suspend(struct xe_device *xe)
> >>>>>> {
> >>>>>> struct xe_gt *gt;
> >>>>>> u8 id;
> >>>>>> - int err;
> >>>>>> + int err = 0;
> >>>>>> +
> >>>>>> + if (xe->d3cold_allowed &&
> xe_device_mem_access_ongoing(xe))
> >>>>>> + return -EBUSY;
> >>>>> Not related to this patch but We should return -EBUSY even for
> >>>>> d3hot as well.
> >>>>
> >>>> Looking at this again, access_ongoing is always going to be false
> >>>> here, right? On the 0 -> 1 transition we always do full sync pm get
> >>>> before increment mem_access.ref, so not sure if this check actually
> >>>> does anything.
> >>> I belive this is paranoid check against any unbalanced put.
> >>> Br,
> >>> Anshuman Gupta.
> >>>>
> >>>>> Br,
> >>>>> Anshuman Gupta
> >>>>>> +
> >>>>>> + /* Disable access_ongoing asserts and prevent recursive pm
> >>>>>> calls */
> >>>>>> + xe_pm_write_callback_task(xe, current);
> >>>>>> if (xe->d3cold_allowed) {
> >>>>>> - if (xe_device_mem_access_ongoing(xe))
> >>>>>> - return -EBUSY;
> >>>>>> -
> >>>>>> err = xe_bo_evict_all(xe);
> >>>>>> if (err)
> >>>>>> - return err;
> >>>>>> + goto out;
> >>>>>> }
> >>>>>> for_each_gt(gt, xe, id) {
> >>>>>> err = xe_gt_suspend(gt);
> >>>>>> if (err)
> >>>>>> - return err;
> >>>>>> + goto out;
> >>>>>> }
> >>>>>> xe_irq_suspend(xe);
> >>>>>> -
> >>>>>> - return 0;
> >>>>>> +out:
> >>>>>> + xe_pm_write_callback_task(xe, NULL);
> >>>>>> + return err;
> >>>>>> }
> >>>>>> int xe_pm_runtime_resume(struct xe_device *xe)
> >>>>>> {
> >>>>>> struct xe_gt *gt;
> >>>>>> u8 id;
> >>>>>> - int err;
> >>>>>> + int err = 0;
> >>>>>> +
> >>>>>> + /* Disable access_ongoing asserts and prevent recursive pm
> >>>>>> calls */
> >>>>>> + xe_pm_write_callback_task(xe, current);
> >>>>>> if (xe->d3cold_allowed) {
> >>>>>> for_each_gt(gt, xe, id) {
> >>>>>> err = xe_pcode_init(gt);
> >>>>>> if (err)
> >>>>>> - return err;
> >>>>>> + goto out;
> >>>>>> }
> >>>>>> /*
> >>>>>> @@ -182,7 +210,7 @@ int xe_pm_runtime_resume(struct
> xe_device
> >>>>>> *xe)
> >>>>>> */
> >>>>>> err = xe_bo_restore_kernel(xe);
> >>>>>> if (err)
> >>>>>> - return err;
> >>>>>> + goto out;
> >>>>>> }
> >>>>>> xe_irq_resume(xe);
> >>>>>> @@ -193,10 +221,11 @@ int xe_pm_runtime_resume(struct
> xe_device
> >>>>>> *xe)
> >>>>>> if (xe->d3cold_allowed) {
> >>>>>> err = xe_bo_restore_user(xe);
> >>>>>> if (err)
> >>>>>> - return err;
> >>>>>> + goto out;
> >>>>>> }
> >>>>>> -
> >>>>>> - return 0;
> >>>>>> +out:
> >>>>>> + xe_pm_write_callback_task(xe, NULL);
> >>>>>> + return err;
> >>>>>> }
> >>>>>> int xe_pm_runtime_get(struct xe_device *xe) @@ -210,14 +239,9
> >>>>>> @@ int xe_pm_runtime_put(struct xe_device *xe)
> >>>>>> return pm_runtime_put_autosuspend(xe->drm.dev);
> >>>>>> }
> >>>>>> -/* Return true if resume operation happened and usage count was
> >>>>>> increased */ -bool xe_pm_runtime_resume_if_suspended(struct
> >>>>>> xe_device *xe)
> >>>>>> +bool xe_pm_runtime_resume_and_get(struct xe_device *xe)
> >>>>>> {
> >>>>>> - /* In case we are suspended we need to immediately wake up
> >>>>>> */
> >>>>>> - if (pm_runtime_suspended(xe->drm.dev))
> >>>>>> - return !pm_runtime_resume_and_get(xe->drm.dev);
> >>>>>> -
> >>>>>> - return false;
> >>>>>> + return !pm_runtime_resume_and_get(xe->drm.dev);
> >>>>>> }
> >>>>>> int xe_pm_runtime_get_if_active(struct xe_device *xe) diff
> >>>>>> --git a/drivers/gpu/drm/xe/xe_pm.h
> b/drivers/gpu/drm/xe/xe_pm.h
> >>>>>> index
> >>>>>> 6a885585f653..e92c508d44b9 100644
> >>>>>> --- a/drivers/gpu/drm/xe/xe_pm.h
> >>>>>> +++ b/drivers/gpu/drm/xe/xe_pm.h
> >>>>>> @@ -19,7 +19,8 @@ int xe_pm_runtime_suspend(struct xe_device
> >>>>>> *xe);
> >>>>>> int xe_pm_runtime_resume(struct xe_device *xe);
> >>>>>> int xe_pm_runtime_get(struct xe_device *xe);
> >>>>>> int xe_pm_runtime_put(struct xe_device *xe); -bool
> >>>>>> xe_pm_runtime_resume_if_suspended(struct xe_device *xe);
> >>>>>> +bool xe_pm_runtime_resume_and_get(struct xe_device *xe);
> >>>>>> int xe_pm_runtime_get_if_active(struct xe_device *xe);
> >>>>>> +struct task_struct *xe_pm_read_callback_task(struct xe_device
> >>>>>> +*xe);
> >>>>>> #endif
More information about the Intel-xe
mailing list