[Intel-xe] [PATCH v12 01/13] drm/xe: fix xe_device_mem_access_get() races

Tue Jul 11 09:00:55 UTC 2023


> -----Original Message-----
> From: Auld, Matthew <matthew.auld at intel.com>
> Sent: Tuesday, July 4, 2023 9:30 PM
> To: Gupta, Anshuman <anshuman.gupta at intel.com>; intel-
> xe at lists.freedesktop.org
> Cc: Vivi, Rodrigo <rodrigo.vivi at intel.com>; Thomas Hellström
> <thomas.hellstrom at linux.intel.com>; Brost, Matthew
> <matthew.brost at intel.com>
> Subject: Re: [PATCH v12 01/13] drm/xe: fix xe_device_mem_access_get() races
> 
> On 04/07/2023 16:29, Gupta, Anshuman wrote:
> >
> >
> > On 7/4/2023 4:55 PM, Matthew Auld wrote:
> >> On 30/06/2023 16:22, Gupta, Anshuman wrote:
> >>>
> >>>
> >>> On 6/26/2023 4:20 PM, Matthew Auld wrote:
> >>>> It looks like there is at least one race here, given that the
> >>>> pm_runtime_suspended() check looks to return false if we are in the
> >>>> process of suspending the device (RPM_SUSPENDING vs
> RPM_SUSPENDED).
> >>>> We later also do xe_pm_runtime_get_if_active(), but since the
> >>>> device is suspending or has now suspended, this doesn't do anything either.
> >>>> Following from this we can potentially return from
> >>>> xe_device_mem_access_get() with the device suspended or about to
> >>>> be, leading to broken behaviour.
> >>>>
> >>>> Attempt to fix this by always grabbing the runtime ref when our
> >>>> internal ref transitions from 0 -> 1. The hard part is then dealing
> >>>> with the runtime_pm callbacks also calling
> >>>> xe_device_mem_access_get() and deadlocking, which the
> >>>> pm_runtime_suspended() check prevented.
> >>>>
> >>>> v2:
> >>>>   - ct->lock looks to be primed with fs_reclaim, so holding that
> >>>> and then
> >>>>     allocating memory will cause lockdep to complain. Now that we
> >>>>     unconditionally grab the mem_access.lock around
> >>>> mem_access_{get,put}, we
> >>>>     need to change the ordering wrt to grabbing the ct->lock, since
> >>>> some of
> >>>>     the runtime_pm routines can allocate memory (or at least that's
> >>>> what
> >>>>     lockdep seems to suggest). Hopefully not a big deal.  It might
> >>>> be that
> >>>>     there were already issues with this, just that the atomics
> >>>> where
> >>>>     "hiding" the potential issues.
> >>>> v3:
> >>>>   - Use Thomas Hellström' idea with tracking the active task that
> >>>> is
> >>>>     executing in the resume or suspend callback, in order to avoid
> >>>>     recursive resume/suspend calls deadlocking on itself.
> >>>>   - Split the ct->lock change.
> >>>> v4:
> >>>>   - Add smb_mb() around accessing the pm_callback_task for extra
> >>>> safety.
> >>>>     (Thomas Hellström)
> >>>> v5:
> >>>>   - Clarify the kernel-doc for the mem_access.lock, given that it
> >>>> is quite
> >>>>     strange in what it protects (data vs code). The real motivation
> >>>> is to
> >>>>     aid lockdep. (Rodrigo Vivi)
> >>>> v6:
> >>>>   - Split out the lock change. We still want this as a lockdep aid
> >>>> but
> >>>>     only for the xe_device_mem_access_get() path. Sticking a lock
> >>>> on the
> >>>>     put() looks be a no-go, also the runtime_put() there is always
> >>>> async.
> >>>>   - Now that the lock is gone move to atomics and rely on the pm
> >>>> code
> >>>>     serialising multiple callers on the 0 -> 1 transition.
> >>>>   - g2h_worker_func() looks to be the next issue, given that
> >>>>     suspend-resume callbacks are using CT, so try to handle that.
> >>>> v7:
> >>>>   - Add xe_device_mem_access_get_if_ongoing(), and use it in
> >>>>     g2h_worker_func().
> >>>>
> >>>> Closes: https://gitlab.freedesktop.org/drm/xe/kernel/-/issues/258
> >>>> Signed-off-by: Matthew Auld <matthew.auld at intel.com>
> >>>> Cc: Rodrigo Vivi <rodrigo.vivi at intel.com>
> >>>> Cc: Thomas Hellström <thomas.hellstrom at linux.intel.com>
> >>>> Cc: Matthew Brost <matthew.brost at intel.com>
> >>>> Cc: Anshuman Gupta <anshuman.gupta at intel.com>
> >>>> ---
> >>>>   drivers/gpu/drm/xe/xe_device.c       | 58
> >>>> +++++++++++++++++++-----
> >>>>   drivers/gpu/drm/xe/xe_device.h       | 12 ++---
> >>>>   drivers/gpu/drm/xe/xe_device_types.h |  6 +++
> >>>>   drivers/gpu/drm/xe/xe_guc_ct.c       | 34 +++++++++++++-
> >>>>   drivers/gpu/drm/xe/xe_pm.c           | 66
> >>>> +++++++++++++++++++---------
> >>>>   drivers/gpu/drm/xe/xe_pm.h           |  3 +-
> >>>>   6 files changed, 134 insertions(+), 45 deletions(-)
> >>>>
> >>>> diff --git a/drivers/gpu/drm/xe/xe_device.c
> >>>> b/drivers/gpu/drm/xe/xe_device.c index c7985af85a53..1dc552da434f
> >>>> 100644
> >>>> --- a/drivers/gpu/drm/xe/xe_device.c
> >>>> +++ b/drivers/gpu/drm/xe/xe_device.c
> >>>> @@ -411,27 +411,61 @@ u32 xe_device_ccs_bytes(struct xe_device *xe,
> >>>> u64 size)
> >>>>           DIV_ROUND_UP(size, NUM_BYTES_PER_CCS_BYTE) : 0;
> >>>>   }
> >>>> +bool xe_device_mem_access_ongoing(struct xe_device *xe) {
> >>>> +    if (xe_pm_read_callback_task(xe) != NULL)
> >>>> +        return true;
> >>>> +
> >>>> +    return atomic_read(&xe->mem_access.ref); }
> >>>> +
> >>>> +void xe_device_assert_mem_access(struct xe_device *xe) {
> >>>> +    XE_WARN_ON(!xe_device_mem_access_ongoing(xe));
> >>>> +}
> >>>> +
> >>>> +bool xe_device_mem_access_get_if_ongoing(struct xe_device *xe) {
> >>>> +    return atomic_inc_not_zero(&xe->mem_access.ref);
> >>>> +}
> >>>> +
> >>>>   void xe_device_mem_access_get(struct xe_device *xe)
> >>>>   {
> >>>> -    bool resumed = xe_pm_runtime_resume_if_suspended(xe);
> >>>> -    int ref = atomic_inc_return(&xe->mem_access.ref);
> >>>> +    /*
> >>>> +     * This looks racy, but should be fine since the
> >>>> pm_callback_task only
> >>>> +     * transitions from NULL -> current (and back to NULL again),
> >>>> during the
> >>>> +     * runtime_resume() or runtime_suspend() callbacks, for which
> >>>> there can
> >>>> +     * only be a single one running for our device. We only need
> >>>> +to
> >>>> prevent
> >>>> +     * recursively calling the runtime_get or runtime_put from
> >>>> +those
> >>>> +     * callbacks, as well as preventing triggering any
> >>>> +access_ongoing
> >>>> +     * asserts.
> >>>> +     */
> >>> two runtime_suspend() can run in parallel for two different pci
> >>> device those worker thread pooled by pm_wq workqueue, it is not
> >>> guaranteed that tast_struct will be different for two worker spawned
> >>> by same pm_wq ?
> >>
> >> IIUC we only use pm_wq for the async put(). If somehow tast_struct
> >> can be the same for different workers when using pm_wq (I'm not sure
> >> tbh), then I think it's fine since all put() must be balanced with a
> >> get(), and all get() are handled directly in the caller (not pm_wq)
> >> since it's non-async, and must first wait for any running callback.
> > Agree with that.
> >>
> >>>
> >>>> +    if (xe_pm_read_callback_task(xe) == current)
> >>>> +        return;
> >>>> -    if (ref == 1)
> >>>> -        xe->mem_access.hold_rpm = xe_pm_runtime_get_if_active(xe);
> >>>> +    if (!atomic_inc_not_zero(&xe->mem_access.ref)) {
> >>>> +        bool hold_rpm = xe_pm_runtime_resume_and_get(xe);
> >>>> +        int ref;
> >>>> -    /* The usage counter increased if device was immediately
> >>>> resumed */
> >>>> -    if (resumed)
> >>>> -        xe_pm_runtime_put(xe);
> >>>> -
> >>>> -    XE_WARN_ON(ref == S32_MAX);
> >>>> +        ref = atomic_inc_return(&xe->mem_access.ref);
> >>>> +        if (ref == 1)
> >>>> +            xe->mem_access.hold_rpm = hold_rpm;
> >>>> +        else
> >>>> +            xe_pm_runtime_put(xe);
> > AFAIU above is not possible ?
> 
> Do you mean the xe_pm_runtime_put() part? On the 0 -> 1 transition we just
> always call xe_pm_runtime_resume_and_get(). That serialises multiple callers
> such that exactly one runs our rpm resume callback and the rest all block waiting
> for it to transition into RPM_ACTIVE. However once we return from
> xe_pm_runtime_resume_and_get() there are potentially several different callers
> holding an rpm ref. We only need the one rpm ref for our needs, so we simply
> drop the rest here keeping the one that first incremented the ref.
Thanks for explanation.
I am still not clear with how are we handling case of error return by xe_pm_runtime_resume_and_get() ?
If one of two simultaneous call of xe_device_mem_access_get() fails at xe_pm_runtime_resume_and_get() 
then it drops the rpm ref of other passed call as well ?   
That eventually provide access to PCI bar without RPM protection.
We should also check returned value from xe_pm_runtime_resume_and_get() before calling 
xe_pm_runtime_put() as pm_runtime_resume_and_get() also drops the device usage count on
error case.
Br,
Anshuman Gupta.
> 
> >>>> +    } else {
> >>>> +        XE_WARN_ON(atomic_read(&xe->mem_access.ref) == S32_MAX);
> >>>> +    }
> >>>>   }
> >>>>   void xe_device_mem_access_put(struct xe_device *xe)
> >>>>   {
> >>>> -    bool hold = xe->mem_access.hold_rpm;
> >>>> -    int ref = atomic_dec_return(&xe->mem_access.ref);
> >>>> +    int ref;
> >>>> -    if (!ref && hold)
> >>>> +    if (xe_pm_read_callback_task(xe) == current)
> >>>> +        return;
> >>>> +
> >>>> +    ref = atomic_dec_return(&xe->mem_access.ref);
> >>>> +    if (ref == 0 && xe->mem_access.hold_rpm)
> >>>>           xe_pm_runtime_put(xe);
> >>>>       XE_WARN_ON(ref < 0);
> >>> /snip
> >>>> +
> >>>>   int xe_pm_runtime_suspend(struct xe_device *xe)
> >>>>   {
> >>>>       struct xe_gt *gt;
> >>>>       u8 id;
> >>>> -    int err;
> >>>> +    int err = 0;
> >>>> +
> >>>> +    if (xe->d3cold_allowed && xe_device_mem_access_ongoing(xe))
> >>>> +        return -EBUSY;
> >>> Not related to this patch but We should return -EBUSY even for d3hot
> >>> as well.
> >>
> >> Looking at this again, access_ongoing is always going to be false
> >> here, right? On the 0 -> 1 transition we always do full sync pm get
> >> before increment mem_access.ref, so not sure if this check actually
> >> does anything.
> > I belive this is paranoid check against any unbalanced put.
> > Br,
> > Anshuman Gupta.
> >>
> >>> Br,
> >>> Anshuman Gupta
> >>>> +
> >>>> +    /* Disable access_ongoing asserts and prevent recursive pm
> >>>> calls */
> >>>> +    xe_pm_write_callback_task(xe, current);
> >>>>       if (xe->d3cold_allowed) {
> >>>> -        if (xe_device_mem_access_ongoing(xe))
> >>>> -            return -EBUSY;
> >>>> -
> >>>>           err = xe_bo_evict_all(xe);
> >>>>           if (err)
> >>>> -            return err;
> >>>> +            goto out;
> >>>>       }
> >>>>       for_each_gt(gt, xe, id) {
> >>>>           err = xe_gt_suspend(gt);
> >>>>           if (err)
> >>>> -            return err;
> >>>> +            goto out;
> >>>>       }
> >>>>       xe_irq_suspend(xe);
> >>>> -
> >>>> -    return 0;
> >>>> +out:
> >>>> +    xe_pm_write_callback_task(xe, NULL);
> >>>> +    return err;
> >>>>   }
> >>>>   int xe_pm_runtime_resume(struct xe_device *xe)
> >>>>   {
> >>>>       struct xe_gt *gt;
> >>>>       u8 id;
> >>>> -    int err;
> >>>> +    int err = 0;
> >>>> +
> >>>> +    /* Disable access_ongoing asserts and prevent recursive pm
> >>>> calls */
> >>>> +    xe_pm_write_callback_task(xe, current);
> >>>>       if (xe->d3cold_allowed) {
> >>>>           for_each_gt(gt, xe, id) {
> >>>>               err = xe_pcode_init(gt);
> >>>>               if (err)
> >>>> -                return err;
> >>>> +                goto out;
> >>>>           }
> >>>>           /*
> >>>> @@ -182,7 +210,7 @@ int xe_pm_runtime_resume(struct xe_device *xe)
> >>>>            */
> >>>>           err = xe_bo_restore_kernel(xe);
> >>>>           if (err)
> >>>> -            return err;
> >>>> +            goto out;
> >>>>       }
> >>>>       xe_irq_resume(xe);
> >>>> @@ -193,10 +221,11 @@ int xe_pm_runtime_resume(struct xe_device
> >>>> *xe)
> >>>>       if (xe->d3cold_allowed) {
> >>>>           err = xe_bo_restore_user(xe);
> >>>>           if (err)
> >>>> -            return err;
> >>>> +            goto out;
> >>>>       }
> >>>> -
> >>>> -    return 0;
> >>>> +out:
> >>>> +    xe_pm_write_callback_task(xe, NULL);
> >>>> +    return err;
> >>>>   }
> >>>>   int xe_pm_runtime_get(struct xe_device *xe) @@ -210,14 +239,9 @@
> >>>> int xe_pm_runtime_put(struct xe_device *xe)
> >>>>       return pm_runtime_put_autosuspend(xe->drm.dev);
> >>>>   }
> >>>> -/* Return true if resume operation happened and usage count was
> >>>> increased */ -bool xe_pm_runtime_resume_if_suspended(struct
> >>>> xe_device *xe)
> >>>> +bool xe_pm_runtime_resume_and_get(struct xe_device *xe)
> >>>>   {
> >>>> -    /* In case we are suspended we need to immediately wake up */
> >>>> -    if (pm_runtime_suspended(xe->drm.dev))
> >>>> -        return !pm_runtime_resume_and_get(xe->drm.dev);
> >>>> -
> >>>> -    return false;
> >>>> +    return !pm_runtime_resume_and_get(xe->drm.dev);
> >>>>   }
> >>>>   int xe_pm_runtime_get_if_active(struct xe_device *xe) diff --git
> >>>> a/drivers/gpu/drm/xe/xe_pm.h b/drivers/gpu/drm/xe/xe_pm.h index
> >>>> 6a885585f653..e92c508d44b9 100644
> >>>> --- a/drivers/gpu/drm/xe/xe_pm.h
> >>>> +++ b/drivers/gpu/drm/xe/xe_pm.h
> >>>> @@ -19,7 +19,8 @@ int xe_pm_runtime_suspend(struct xe_device *xe);
> >>>>   int xe_pm_runtime_resume(struct xe_device *xe);
> >>>>   int xe_pm_runtime_get(struct xe_device *xe);
> >>>>   int xe_pm_runtime_put(struct xe_device *xe); -bool
> >>>> xe_pm_runtime_resume_if_suspended(struct xe_device *xe);
> >>>> +bool xe_pm_runtime_resume_and_get(struct xe_device *xe);
> >>>>   int xe_pm_runtime_get_if_active(struct xe_device *xe);
> >>>> +struct task_struct *xe_pm_read_callback_task(struct xe_device
> >>>> +*xe);
> >>>>   #endif