[Intel-xe] [PATCH 2/3] drm/xe: fix xe_device_mem_access_get() race

Mon May 15 10:15:23 UTC 2023

On 15/05/2023 10:22, Thomas Hellström wrote:
> 
> On 5/15/23 11:04, Matthew Auld wrote:
>> On 15/05/2023 09:44, Thomas Hellström wrote:
>>> Hi, Matthew,
>>>
>>> On 5/12/23 17:56, Matthew Auld wrote:
>>>> On 12/05/2023 15:32, Rodrigo Vivi wrote:
>>>>> On Thu, May 11, 2023 at 12:43:36PM +0100, Matthew Auld wrote:
>>>>>> On 05/05/2023 17:44, Rodrigo Vivi wrote:
>>>>>>> On Fri, May 05, 2023 at 04:38:53PM +0100, Matthew Auld wrote:
>>>>>>>> It looks like there is at least one race here, given that the
>>>>>>>> pm_runtime_suspended() check looks to return false if we are in the
>>>>>>>> process of suspending the device (RPM_SUSPENDING vs 
>>>>>>>> RPM_SUSPENDED). We
>>>>>>>> later also do xe_pm_runtime_get_if_active(), but since the 
>>>>>>>> device is
>>>>>>>> suspending or has now suspended, this doesn't do anything either.
>>>>>>>> Following from this we can potentially return from
>>>>>>>> xe_device_mem_access_get() with the device suspended or about to 
>>>>>>>> be,
>>>>>>>> leading to broken behaviour.
>>>>>>>>
>>>>>>>> Attempt to fix this by always grabbing the runtime ref when our 
>>>>>>>> internal
>>>>>>>> ref transitions from 0 -> 1, and then wrap the whole thing with 
>>>>>>>> a lock
>>>>>>>> to ensure callers are serialized.
>>>>>>>>
>>>>>>>> Closes: https://gitlab.freedesktop.org/drm/xe/kernel/-/issues/258
>>>>>>>> Signed-off-by: Matthew Auld <matthew.auld at intel.com>
>>>>>>>> Cc: Rodrigo Vivi <rodrigo.vivi at intel.com>
>>>>>>>> Cc: Thomas Hellström <thomas.hellstrom at linux.intel.com>
>>>>>>>> Cc: Matthew Brost <matthew.brost at intel.com>
>>>>>>>> ---
>>>>>>>>    drivers/gpu/drm/xe/xe_device.c       | 11 +++--------
>>>>>>>>    drivers/gpu/drm/xe/xe_device_types.h |  5 ++++-
>>>>>>>>    drivers/gpu/drm/xe/xe_pm.c           |  9 ++-------
>>>>>>>>    drivers/gpu/drm/xe/xe_pm.h           |  2 +-
>>>>>>>>    4 files changed, 10 insertions(+), 17 deletions(-)
>>>>>>>>
>>>>>>>> diff --git a/drivers/gpu/drm/xe/xe_device.c 
>>>>>>>> b/drivers/gpu/drm/xe/xe_device.c
>>>>>>>> index 01c497bcf9a5..0a18b41a0e1a 100644
>>>>>>>> --- a/drivers/gpu/drm/xe/xe_device.c
>>>>>>>> +++ b/drivers/gpu/drm/xe/xe_device.c
>>>>>>>> @@ -406,17 +406,12 @@ u32 xe_device_ccs_bytes(struct xe_device 
>>>>>>>> *xe, u64 size)
>>>>>>>>    void xe_device_mem_access_get(struct xe_device *xe)
>>>>>>>>    {
>>>>>>>> -    bool resumed = xe_pm_runtime_resume_if_suspended(xe);
>>>>>>>> -
>>>>>>>>        mutex_lock(&xe->mem_access.lock);
>>>>>>>> -    if (xe->mem_access.ref++ == 0)
>>>>>>>> -        xe->mem_access.hold_rpm = xe_pm_runtime_get_if_active(xe);
>>>>>>>> +    if (xe->mem_access.ref == 0)
>>>>>>>> +        xe->mem_access.hold_rpm = 
>>>>>>>> xe_pm_runtime_resume_and_get(xe);
>>>>>>>> +    xe->mem_access.ref++;
>>>>>>>
>>>>>>> my memory is good to the point that I have tried this before...
>>>>>>> but not good enough to remember the issues that I got with this
>>>>>>> approach :(
>>>>>>
>>>>>> Ok, it seems one big issue is around xe_pm_runtime_resume() et al. 
>>>>>> The lock
>>>>>> fixes all the races, but xe_pm_runtime_resume() seems to call
>>>>>> xe_device_mem_access_{get,put}() in loads of places AFAICT, but 
>>>>>> that is ofc
>>>>>> going to deadlock if we introduce a lock, since we are inside the 
>>>>>> callback.
>>>>>> But I think even without that lock it will still deadlock, since the
>>>>>> runtime_pm code will see that we are PM_RESUMING and wait for 
>>>>>> itself. I'm
>>>>>> guessing that explains why we had the conditional 
>>>>>> pm_runtime_suspended() and
>>>>>> if_active(), since that prevents triggering the runtime_pm from our
>>>>>> callbacks (we will either be PM_SUSPENDING or PM_RESUMING),
>>>>>
>>>>> Yes, that was the goal.
>>>>>
>>>>>> but then we are
>>>>>> ofc left with all the nasty races.
>>>>>
>>>>> :(
>>>>>
>>>>>> Any ideas? It seems like the
>>>>>> resume/suspend callbacks should fundamentally never be calling
>>>>>> xe_device_mem_access_{get,put}()?
>>>>>
>>>>> We probably need something like the pseudo code in the end of
>>>>> Documentation/power/runtime_pm.rst
>>>>
>>>> IIUC that looks like a template for where you have lots of incoming 
>>>> IO requests, and you want to process them in an async manner. But to 
>>>> process the request you need to also keep the device awake and 
>>>> ensure it has resumed, so here it makes sense to use the RPM_ASYNC 
>>>> runtime_get. And then in the ->resume() callback it then has a 
>>>> natural place to at some later point process the outstanding IO 
>>>> requests.
>>>>
>>>> But for xe_device_mem_access_get() I don't think it can be async, 
>>>> since we need to ensure the ->resume() callback has already 
>>>> completed before returning to the caller. From the pov of 
>>>> xe_device_mem_access_get() it doesn't know if it's being called from 
>>>> ->resume() or some other normal path and yet it seems like it 
>>>> somehow needs to differentiate them. I feel like either the 
>>>> ->resume() should never have been allowed to call it in the first 
>>>> place, or there needs to be some token that always gets passed around.
>>>>
>>>> But maybe I'm misunderstanding something here, or at least I'm not 
>>>> currently seeing the connection with the pseudo code? Can you share 
>>>> more?
>>>
>>> Some fly-by ideas here:
>>>
>>> With the variant we briefly discussed before, to only lock during 0-1 
>>> and 1-0 transitions, using atomic_inc_not_zero() and 
>>> atomic_add_unless() then the get() problem goes out of the way, but 
>>> any recursive get() and put() during put()->suspend will then cause 
>>> similar problems?
>>
>> Yeah, I was thinking something like that, but AFAICT the runtime_pm 
>> code will see that it is already in a PM_RESUMING state and wait for 
>> itself to exit from that state, which also deadlocks.
> 
> Ah, ok, yes everything except the recursive calls from resume() needs to 
> wait for PM_RESUMING to work correctly?

Yes, that is at least my understanding, assuming a synchronous get() 
call. With async pm it doesn't need to wait.

> 
>>
>>>
>>> While a bit hacky, that could probably be solved having an
>>>
>>> xe_device::suspending_task set and cleared by the suspending task and 
>>> then use
>>>
>>> if (xe->suspending_task == current)
>>
>> I didn't think of that tbh. Does seem hacky, but I think should work, 
>> and perhaps fine for now?
>>
> Yes, I guess it has to be, at least until we come up with something better.

I'll try to send something later today.

> 
> /Thomas
> 
> 
>>>
>>> to skip pm calls during suspend. Assuming of course that all 
>>> suspend() task for a single device are never run concurrently.
>>
>> Yes, AFAICT only one resume or suspend callback can be running for a 
>> device, until we exit from PM_SUSPENDING or PM_RESUMING.
>>
>>>
>>> /Thomas
>>>
>>>>
>>>>>
>>>>>>
>>>>>>>
>>>>>>>> mutex_unlock(&xe->mem_access.lock);
>>>>>>>> -    /* The usage counter increased if device was immediately 
>>>>>>>> resumed */
>>>>>>>> -    if (resumed)
>>>>>>>> -        xe_pm_runtime_put(xe);
>>>>>>>> -
>>>>>>>>        XE_WARN_ON(xe->mem_access.ref == S32_MAX);
>>>>>>>>    }
>>>>>>>> diff --git a/drivers/gpu/drm/xe/xe_device_types.h 
>>>>>>>> b/drivers/gpu/drm/xe/xe_device_types.h
>>>>>>>> index 59462933f67a..9e37189d5745 100644
>>>>>>>> --- a/drivers/gpu/drm/xe/xe_device_types.h
>>>>>>>> +++ b/drivers/gpu/drm/xe/xe_device_types.h
>>>>>>>> @@ -256,7 +256,10 @@ struct xe_device {
>>>>>>>>         * triggering additional actions when they occur.
>>>>>>>>         */
>>>>>>>>        struct {
>>>>>>>> -        /** @lock: protect the ref count */
>>>>>>>> +        /**
>>>>>>>> +         * @lock: Serialize xe_device_mem_access users,
>>>>>>>> +         * and protect the below internal state, like @ref.
>>>>>>>> +         */
>>>>>>>>            struct mutex lock;
>>>>>>>>            /** @ref: ref count of memory accesses */
>>>>>>>>            s32 ref;
>>>>>>>> diff --git a/drivers/gpu/drm/xe/xe_pm.c 
>>>>>>>> b/drivers/gpu/drm/xe/xe_pm.c
>>>>>>>> index b7b57f10ba25..b2ffa001e6f7 100644
>>>>>>>> --- a/drivers/gpu/drm/xe/xe_pm.c
>>>>>>>> +++ b/drivers/gpu/drm/xe/xe_pm.c
>>>>>>>> @@ -210,14 +210,9 @@ int xe_pm_runtime_put(struct xe_device *xe)
>>>>>>>>        return pm_runtime_put_autosuspend(xe->drm.dev);
>>>>>>>>    }
>>>>>>>> -/* Return true if resume operation happened and usage count was 
>>>>>>>> increased */
>>>>>>>> -bool xe_pm_runtime_resume_if_suspended(struct xe_device *xe)
>>>>>>>> +bool xe_pm_runtime_resume_and_get(struct xe_device *xe)
>>>>>>>>    {
>>>>>>>> -    /* In case we are suspended we need to immediately wake up */
>>>>>>>> -    if (pm_runtime_suspended(xe->drm.dev))
>>>>>>>> -        return !pm_runtime_resume_and_get(xe->drm.dev);
>>>>>>>> -
>>>>>>>> -    return false;
>>>>>>>> +    return !pm_runtime_resume_and_get(xe->drm.dev);
>>>>>>>
>>>>>>> now with similar name I feel strange that we are not aligned with 
>>>>>>> their
>>>>>>> return. Although I prefer our one...
>>>>>>>
>>>>>>> Anyway, the code is right... if you are testing and it is working 
>>>>>>> well
>>>>>>> let's move with this.
>>>>>>>
>>>>>>> Reviewed-by: Rodrigo Vivi <rodrigo.vivi at intel.com>
>>>>>>> (for the series)
>>>>>>>
>>>>>>> but let's get an Ack from Maarten since he was kept as author on 
>>>>>>> patch 3
>>>>>>> and it is modified from his merged one.
>>>>>>>
>>>>>>>>    }
>>>>>>>>    int xe_pm_runtime_get_if_active(struct xe_device *xe)
>>>>>>>> diff --git a/drivers/gpu/drm/xe/xe_pm.h 
>>>>>>>> b/drivers/gpu/drm/xe/xe_pm.h
>>>>>>>> index 6a885585f653..1b4c15b5e71a 100644
>>>>>>>> --- a/drivers/gpu/drm/xe/xe_pm.h
>>>>>>>> +++ b/drivers/gpu/drm/xe/xe_pm.h
>>>>>>>> @@ -19,7 +19,7 @@ int xe_pm_runtime_suspend(struct xe_device *xe);
>>>>>>>>    int xe_pm_runtime_resume(struct xe_device *xe);
>>>>>>>>    int xe_pm_runtime_get(struct xe_device *xe);
>>>>>>>>    int xe_pm_runtime_put(struct xe_device *xe);
>>>>>>>> -bool xe_pm_runtime_resume_if_suspended(struct xe_device *xe);
>>>>>>>> +bool xe_pm_runtime_resume_and_get(struct xe_device *xe);
>>>>>>>>    int xe_pm_runtime_get_if_active(struct xe_device *xe);
>>>>>>>>    #endif
>>>>>>>> -- 
>>>>>>>> 2.40.0
>>>>>>>>