[Intel-xe] [PATCH v12 01/13] drm/xe: fix xe_device_mem_access_get() races

Tue Jul 11 17:56:20 UTC 2023


> -----Original Message-----
> From: Auld, Matthew <matthew.auld at intel.com>
> Sent: Tuesday, July 11, 2023 4:36 PM
> To: Gupta, Anshuman <anshuman.gupta at intel.com>; intel-
> xe at lists.freedesktop.org
> Cc: Vivi, Rodrigo <rodrigo.vivi at intel.com>; Thomas Hellström
> <thomas.hellstrom at linux.intel.com>; Brost, Matthew
> <matthew.brost at intel.com>
> Subject: Re: [PATCH v12 01/13] drm/xe: fix xe_device_mem_access_get()
> races
> 
> On 11/07/2023 10:00, Gupta, Anshuman wrote:
> >
> >
> >> -----Original Message-----
> >> From: Auld, Matthew <matthew.auld at intel.com>
> >> Sent: Tuesday, July 4, 2023 9:30 PM
> >> To: Gupta, Anshuman <anshuman.gupta at intel.com>; intel-
> >> xe at lists.freedesktop.org
> >> Cc: Vivi, Rodrigo <rodrigo.vivi at intel.com>; Thomas Hellström
> >> <thomas.hellstrom at linux.intel.com>; Brost, Matthew
> >> <matthew.brost at intel.com>
> >> Subject: Re: [PATCH v12 01/13] drm/xe: fix xe_device_mem_access_get()
> >> races
> >>
> >> On 04/07/2023 16:29, Gupta, Anshuman wrote:
> >>>
> >>>
> >>> On 7/4/2023 4:55 PM, Matthew Auld wrote:
> >>>> On 30/06/2023 16:22, Gupta, Anshuman wrote:
> >>>>>
> >>>>>
> >>>>> On 6/26/2023 4:20 PM, Matthew Auld wrote:
> >>>>>> It looks like there is at least one race here, given that the
> >>>>>> pm_runtime_suspended() check looks to return false if we are in
> >>>>>> the process of suspending the device (RPM_SUSPENDING vs
> >> RPM_SUSPENDED).
> >>>>>> We later also do xe_pm_runtime_get_if_active(), but since the
> >>>>>> device is suspending or has now suspended, this doesn't do anything
> either.
> >>>>>> Following from this we can potentially return from
> >>>>>> xe_device_mem_access_get() with the device suspended or about
> to
> >>>>>> be, leading to broken behaviour.
> >>>>>>
> >>>>>> Attempt to fix this by always grabbing the runtime ref when our
> >>>>>> internal ref transitions from 0 -> 1. The hard part is then
> >>>>>> dealing with the runtime_pm callbacks also calling
> >>>>>> xe_device_mem_access_get() and deadlocking, which the
> >>>>>> pm_runtime_suspended() check prevented.
> >>>>>>
> >>>>>> v2:
> >>>>>>    - ct->lock looks to be primed with fs_reclaim, so holding that
> >>>>>> and then
> >>>>>>      allocating memory will cause lockdep to complain. Now that
> >>>>>> we
> >>>>>>      unconditionally grab the mem_access.lock around
> >>>>>> mem_access_{get,put}, we
> >>>>>>      need to change the ordering wrt to grabbing the ct->lock,
> >>>>>> since some of
> >>>>>>      the runtime_pm routines can allocate memory (or at least
> >>>>>> that's what
> >>>>>>      lockdep seems to suggest). Hopefully not a big deal.  It
> >>>>>> might be that
> >>>>>>      there were already issues with this, just that the atomics
> >>>>>> where
> >>>>>>      "hiding" the potential issues.
> >>>>>> v3:
> >>>>>>    - Use Thomas Hellström' idea with tracking the active task
> >>>>>> that is
> >>>>>>      executing in the resume or suspend callback, in order to
> >>>>>> avoid
> >>>>>>      recursive resume/suspend calls deadlocking on itself.
> >>>>>>    - Split the ct->lock change.
> >>>>>> v4:
> >>>>>>    - Add smb_mb() around accessing the pm_callback_task for extra
> >>>>>> safety.
> >>>>>>      (Thomas Hellström)
> >>>>>> v5:
> >>>>>>    - Clarify the kernel-doc for the mem_access.lock, given that
> >>>>>> it is quite
> >>>>>>      strange in what it protects (data vs code). The real
> >>>>>> motivation is to
> >>>>>>      aid lockdep. (Rodrigo Vivi)
> >>>>>> v6:
> >>>>>>    - Split out the lock change. We still want this as a lockdep
> >>>>>> aid but
> >>>>>>      only for the xe_device_mem_access_get() path. Sticking a
> >>>>>> lock on the
> >>>>>>      put() looks be a no-go, also the runtime_put() there is
> >>>>>> always async.
> >>>>>>    - Now that the lock is gone move to atomics and rely on the pm
> >>>>>> code
> >>>>>>      serialising multiple callers on the 0 -> 1 transition.
> >>>>>>    - g2h_worker_func() looks to be the next issue, given that
> >>>>>>      suspend-resume callbacks are using CT, so try to handle that.
> >>>>>> v7:
> >>>>>>    - Add xe_device_mem_access_get_if_ongoing(), and use it in
> >>>>>>      g2h_worker_func().
> >>>>>>
> >>>>>> Closes: https://gitlab.freedesktop.org/drm/xe/kernel/-/issues/258
> >>>>>> Signed-off-by: Matthew Auld <matthew.auld at intel.com>
> >>>>>> Cc: Rodrigo Vivi <rodrigo.vivi at intel.com>
> >>>>>> Cc: Thomas Hellström <thomas.hellstrom at linux.intel.com>
> >>>>>> Cc: Matthew Brost <matthew.brost at intel.com>
> >>>>>> Cc: Anshuman Gupta <anshuman.gupta at intel.com>
> >>>>>> ---
> >>>>>>    drivers/gpu/drm/xe/xe_device.c       | 58
> >>>>>> +++++++++++++++++++-----
> >>>>>>    drivers/gpu/drm/xe/xe_device.h       | 12 ++---
> >>>>>>    drivers/gpu/drm/xe/xe_device_types.h |  6 +++
> >>>>>>    drivers/gpu/drm/xe/xe_guc_ct.c       | 34 +++++++++++++-
> >>>>>>    drivers/gpu/drm/xe/xe_pm.c           | 66
> >>>>>> +++++++++++++++++++---------
> >>>>>>    drivers/gpu/drm/xe/xe_pm.h           |  3 +-
> >>>>>>    6 files changed, 134 insertions(+), 45 deletions(-)
> >>>>>>
> >>>>>> diff --git a/drivers/gpu/drm/xe/xe_device.c
> >>>>>> b/drivers/gpu/drm/xe/xe_device.c index
> c7985af85a53..1dc552da434f
> >>>>>> 100644
> >>>>>> --- a/drivers/gpu/drm/xe/xe_device.c
> >>>>>> +++ b/drivers/gpu/drm/xe/xe_device.c
> >>>>>> @@ -411,27 +411,61 @@ u32 xe_device_ccs_bytes(struct xe_device
> >>>>>> *xe,
> >>>>>> u64 size)
> >>>>>>            DIV_ROUND_UP(size, NUM_BYTES_PER_CCS_BYTE) : 0;
> >>>>>>    }
> >>>>>> +bool xe_device_mem_access_ongoing(struct xe_device *xe) {
> >>>>>> +    if (xe_pm_read_callback_task(xe) != NULL)
> >>>>>> +        return true;
> >>>>>> +
> >>>>>> +    return atomic_read(&xe->mem_access.ref); }
> >>>>>> +
> >>>>>> +void xe_device_assert_mem_access(struct xe_device *xe) {
> >>>>>> +    XE_WARN_ON(!xe_device_mem_access_ongoing(xe));
> >>>>>> +}
> >>>>>> +
> >>>>>> +bool xe_device_mem_access_get_if_ongoing(struct xe_device
> *xe) {
> >>>>>> +    return atomic_inc_not_zero(&xe->mem_access.ref);
> >>>>>> +}
> >>>>>> +
> >>>>>>    void xe_device_mem_access_get(struct xe_device *xe)
> >>>>>>    {
> >>>>>> -    bool resumed = xe_pm_runtime_resume_if_suspended(xe);
> >>>>>> -    int ref = atomic_inc_return(&xe->mem_access.ref);
> >>>>>> +    /*
> >>>>>> +     * This looks racy, but should be fine since the
> >>>>>> pm_callback_task only
> >>>>>> +     * transitions from NULL -> current (and back to NULL
> >>>>>> +again),
> >>>>>> during the
> >>>>>> +     * runtime_resume() or runtime_suspend() callbacks, for
> >>>>>> +which
> >>>>>> there can
> >>>>>> +     * only be a single one running for our device. We only need
> >>>>>> +to
> >>>>>> prevent
> >>>>>> +     * recursively calling the runtime_get or runtime_put from
> >>>>>> +those
> >>>>>> +     * callbacks, as well as preventing triggering any
> >>>>>> +access_ongoing
> >>>>>> +     * asserts.
> >>>>>> +     */
> >>>>> two runtime_suspend() can run in parallel for two different pci
> >>>>> device those worker thread pooled by pm_wq workqueue, it is not
> >>>>> guaranteed that tast_struct will be different for two worker
> >>>>> spawned by same pm_wq ?
> >>>>
> >>>> IIUC we only use pm_wq for the async put(). If somehow tast_struct
> >>>> can be the same for different workers when using pm_wq (I'm not
> >>>> sure tbh), then I think it's fine since all put() must be balanced
> >>>> with a get(), and all get() are handled directly in the caller (not
> >>>> pm_wq) since it's non-async, and must first wait for any running
> callback.
> >>> Agree with that.
> >>>>
> >>>>>
> >>>>>> +    if (xe_pm_read_callback_task(xe) == current)
> >>>>>> +        return;
> >>>>>> -    if (ref == 1)
> >>>>>> -        xe->mem_access.hold_rpm =
> >>>>>> xe_pm_runtime_get_if_active(xe);
> >>>>>> +    if (!atomic_inc_not_zero(&xe->mem_access.ref)) {
> >>>>>> +        bool hold_rpm = xe_pm_runtime_resume_and_get(xe);
> >>>>>> +        int ref;
> >>>>>> -    /* The usage counter increased if device was immediately
> >>>>>> resumed */
> >>>>>> -    if (resumed)
> >>>>>> -        xe_pm_runtime_put(xe);
> >>>>>> -
> >>>>>> -    XE_WARN_ON(ref == S32_MAX);
> >>>>>> +        ref = atomic_inc_return(&xe->mem_access.ref);
> >>>>>> +        if (ref == 1)
> >>>>>> +            xe->mem_access.hold_rpm = hold_rpm;
> >>>>>> +        else
> >>>>>> +            xe_pm_runtime_put(xe);
> >>> AFAIU above is not possible ?
> >>
> >> Do you mean the xe_pm_runtime_put() part? On the 0 -> 1 transition we
> >> just always call xe_pm_runtime_resume_and_get(). That serialises
> >> multiple callers such that exactly one runs our rpm resume callback
> >> and the rest all block waiting for it to transition into RPM_ACTIVE.
> >> However once we return from
> >> xe_pm_runtime_resume_and_get() there are potentially several
> >> different callers holding an rpm ref. We only need the one rpm ref
> >> for our needs, so we simply drop the rest here keeping the one that first
> incremented the ref.
> > Thanks for explanation.
> > I am still not clear with how are we handling case of error return by
> xe_pm_runtime_resume_and_get() ?
> > If one of two simultaneous call of xe_device_mem_access_get() fails at
> > xe_pm_runtime_resume_and_get() then it drops the rpm ref of other
> passed call as well ?
> 
> 
> I guess we could do something like:
> 
> -               if (ref == 1)
> +               if (ref == 1 && hold_rpm)
>                          xe->mem_access.hold_rpm = hold_rpm;
> -               else
> +               else if (hold_rpm)
>                          xe_pm_runtime_put(xe);
> 
> Or we can just always grab the rpm:
> 
> -       if (!atomic_inc_not_zero(&xe->mem_access.ref)) {
> -               bool hold_rpm = xe_pm_runtime_resume_and_get(xe);
> -               int ref;
> -
> -               ref = atomic_inc_return(&xe->mem_access.ref);
> -               if (ref == 1)
> -                       xe->mem_access.hold_rpm = hold_rpm;
> -               else
> -                       xe_pm_runtime_put(xe);
> -       } else {
> -               XE_WARN_ON(atomic_read(&xe->mem_access.ref) == S32_MAX);
> -       }
> +       xe_pm_runtime_resume_and_get(xe);
> +       ref = atomic_inc_return(&xe->mem_access.ref);
> +       XE_WARN_ON(ref == S32_MAX);
IMO this is simple and better.
> 
>          lock_map_release(&xe_device_mem_access_lockdep_map);
>   }
> @@ -489,8 +480,7 @@ void xe_device_mem_access_put(struct xe_device
> *xe)
>                  return;
> 
>          ref = atomic_dec_return(&xe->mem_access.ref);
> -       if (ref == 0 && xe->mem_access.hold_rpm)
> -               xe_pm_runtime_put(xe);
> +       xe_pm_runtime_put(xe);
> 
> 
> And then also remove hold_rpm stuff. But either way there is still the big
> issue of what to do if the rpm get fails. If it fails as you say, the driver is still
> broken even with the above.
AFAIU we can't bail out from such a state , printing a kernel Error will suffice here.
Referring runtime documentation, 
* If the resume callback returns an error code, the PM core regards this as a
    fatal error and will refuse to run the helper functions described in Section
    4 for the device, until its status is directly set to either 'active', or
    'suspended' (by means of special helper functions provided by the PM core
    for this purpose).
*https://www.kernel.org/doc/Documentation/power/runtime_pm.txt
> 
> > That eventually provide access to PCI bar without RPM protection.
> > We should also check returned value from
> > xe_pm_runtime_resume_and_get() before calling
> > xe_pm_runtime_put() as pm_runtime_resume_and_get() also drops the
> > device usage count on error case.
> 
> So do we already know how we want to handle such failures from callers
> pov? Currently there is no return value for xe_device_mem_access_get(), so
> caller currently has zero clue, and any device access will be end badly
> anyway. But I wasn't really trying to solve that in this series.
There is no way to handle runtime  PM error, my comment was just about the 
nonessential xe_pm_runtime_put().
BR,
Anshuman Gupta.
> 
> Do we just want to return an error if access_get fails? If so handling the
> above becomes quite simple, with something like the following in
> xe_device_mem_access_get():
> 
> +       if (!xe_pm_runtime_resume_and_get(xe))
> +               return false;
> 
> > Br,
> > Anshuman Gupta.
> >>
> >>>>>> +    } else {
> >>>>>> +        XE_WARN_ON(atomic_read(&xe->mem_access.ref) ==
> S32_MAX);
> >>>>>> +    }
> >>>>>>    }
> >>>>>>    void xe_device_mem_access_put(struct xe_device *xe)
> >>>>>>    {
> >>>>>> -    bool hold = xe->mem_access.hold_rpm;
> >>>>>> -    int ref = atomic_dec_return(&xe->mem_access.ref);
> >>>>>> +    int ref;
> >>>>>> -    if (!ref && hold)
> >>>>>> +    if (xe_pm_read_callback_task(xe) == current)
> >>>>>> +        return;
> >>>>>> +
> >>>>>> +    ref = atomic_dec_return(&xe->mem_access.ref);
> >>>>>> +    if (ref == 0 && xe->mem_access.hold_rpm)
> >>>>>>            xe_pm_runtime_put(xe);
> >>>>>>        XE_WARN_ON(ref < 0);
> >>>>> /snip
> >>>>>> +
> >>>>>>    int xe_pm_runtime_suspend(struct xe_device *xe)
> >>>>>>    {
> >>>>>>        struct xe_gt *gt;
> >>>>>>        u8 id;
> >>>>>> -    int err;
> >>>>>> +    int err = 0;
> >>>>>> +
> >>>>>> +    if (xe->d3cold_allowed &&
> xe_device_mem_access_ongoing(xe))
> >>>>>> +        return -EBUSY;
> >>>>> Not related to this patch but We should return -EBUSY even for
> >>>>> d3hot as well.
> >>>>
> >>>> Looking at this again, access_ongoing is always going to be false
> >>>> here, right? On the 0 -> 1 transition we always do full sync pm get
> >>>> before increment mem_access.ref, so not sure if this check actually
> >>>> does anything.
> >>> I belive this is paranoid check against any unbalanced put.
> >>> Br,
> >>> Anshuman Gupta.
> >>>>
> >>>>> Br,
> >>>>> Anshuman Gupta
> >>>>>> +
> >>>>>> +    /* Disable access_ongoing asserts and prevent recursive pm
> >>>>>> calls */
> >>>>>> +    xe_pm_write_callback_task(xe, current);
> >>>>>>        if (xe->d3cold_allowed) {
> >>>>>> -        if (xe_device_mem_access_ongoing(xe))
> >>>>>> -            return -EBUSY;
> >>>>>> -
> >>>>>>            err = xe_bo_evict_all(xe);
> >>>>>>            if (err)
> >>>>>> -            return err;
> >>>>>> +            goto out;
> >>>>>>        }
> >>>>>>        for_each_gt(gt, xe, id) {
> >>>>>>            err = xe_gt_suspend(gt);
> >>>>>>            if (err)
> >>>>>> -            return err;
> >>>>>> +            goto out;
> >>>>>>        }
> >>>>>>        xe_irq_suspend(xe);
> >>>>>> -
> >>>>>> -    return 0;
> >>>>>> +out:
> >>>>>> +    xe_pm_write_callback_task(xe, NULL);
> >>>>>> +    return err;
> >>>>>>    }
> >>>>>>    int xe_pm_runtime_resume(struct xe_device *xe)
> >>>>>>    {
> >>>>>>        struct xe_gt *gt;
> >>>>>>        u8 id;
> >>>>>> -    int err;
> >>>>>> +    int err = 0;
> >>>>>> +
> >>>>>> +    /* Disable access_ongoing asserts and prevent recursive pm
> >>>>>> calls */
> >>>>>> +    xe_pm_write_callback_task(xe, current);
> >>>>>>        if (xe->d3cold_allowed) {
> >>>>>>            for_each_gt(gt, xe, id) {
> >>>>>>                err = xe_pcode_init(gt);
> >>>>>>                if (err)
> >>>>>> -                return err;
> >>>>>> +                goto out;
> >>>>>>            }
> >>>>>>            /*
> >>>>>> @@ -182,7 +210,7 @@ int xe_pm_runtime_resume(struct
> xe_device
> >>>>>> *xe)
> >>>>>>             */
> >>>>>>            err = xe_bo_restore_kernel(xe);
> >>>>>>            if (err)
> >>>>>> -            return err;
> >>>>>> +            goto out;
> >>>>>>        }
> >>>>>>        xe_irq_resume(xe);
> >>>>>> @@ -193,10 +221,11 @@ int xe_pm_runtime_resume(struct
> xe_device
> >>>>>> *xe)
> >>>>>>        if (xe->d3cold_allowed) {
> >>>>>>            err = xe_bo_restore_user(xe);
> >>>>>>            if (err)
> >>>>>> -            return err;
> >>>>>> +            goto out;
> >>>>>>        }
> >>>>>> -
> >>>>>> -    return 0;
> >>>>>> +out:
> >>>>>> +    xe_pm_write_callback_task(xe, NULL);
> >>>>>> +    return err;
> >>>>>>    }
> >>>>>>    int xe_pm_runtime_get(struct xe_device *xe) @@ -210,14 +239,9
> >>>>>> @@ int xe_pm_runtime_put(struct xe_device *xe)
> >>>>>>        return pm_runtime_put_autosuspend(xe->drm.dev);
> >>>>>>    }
> >>>>>> -/* Return true if resume operation happened and usage count was
> >>>>>> increased */ -bool xe_pm_runtime_resume_if_suspended(struct
> >>>>>> xe_device *xe)
> >>>>>> +bool xe_pm_runtime_resume_and_get(struct xe_device *xe)
> >>>>>>    {
> >>>>>> -    /* In case we are suspended we need to immediately wake up
> >>>>>> */
> >>>>>> -    if (pm_runtime_suspended(xe->drm.dev))
> >>>>>> -        return !pm_runtime_resume_and_get(xe->drm.dev);
> >>>>>> -
> >>>>>> -    return false;
> >>>>>> +    return !pm_runtime_resume_and_get(xe->drm.dev);
> >>>>>>    }
> >>>>>>    int xe_pm_runtime_get_if_active(struct xe_device *xe) diff
> >>>>>> --git a/drivers/gpu/drm/xe/xe_pm.h
> b/drivers/gpu/drm/xe/xe_pm.h
> >>>>>> index
> >>>>>> 6a885585f653..e92c508d44b9 100644
> >>>>>> --- a/drivers/gpu/drm/xe/xe_pm.h
> >>>>>> +++ b/drivers/gpu/drm/xe/xe_pm.h
> >>>>>> @@ -19,7 +19,8 @@ int xe_pm_runtime_suspend(struct xe_device
> >>>>>> *xe);
> >>>>>>    int xe_pm_runtime_resume(struct xe_device *xe);
> >>>>>>    int xe_pm_runtime_get(struct xe_device *xe);
> >>>>>>    int xe_pm_runtime_put(struct xe_device *xe); -bool
> >>>>>> xe_pm_runtime_resume_if_suspended(struct xe_device *xe);
> >>>>>> +bool xe_pm_runtime_resume_and_get(struct xe_device *xe);
> >>>>>>    int xe_pm_runtime_get_if_active(struct xe_device *xe);
> >>>>>> +struct task_struct *xe_pm_read_callback_task(struct xe_device
> >>>>>> +*xe);
> >>>>>>    #endif