[PATCH] drm/amdkfd: restore_process_worker race with GPU reset
Felix Kuehling
felix.kuehling at amd.com
Sat Aug 31 03:43:17 UTC 2024
On 2024-08-29 18:16, Philip Yang wrote:
>
> On 2024-08-29 17:15, Felix Kuehling wrote:
>> On 2024-08-23 15:49, Philip Yang wrote:
>>> If GPU reset kick in while KFD restore_process_worker running, this may
>>> causes different issues, for example below rcu stall warning, because
>>> restore work may move BOs and evict queues under VRAM pressure.
>>>
>>> Fix this race by taking adev reset_domain read semaphore to prevent GPU
>>> reset in restore_process_worker, the reset read semaphore can be taken
>>> recursively if adev have multiple partitions.
>>>
>>> Then there is live locking issue if CP hangs while
>>> restore_process_worker runs, then GPU reset wait for semaphore to start
>>> and restore_process_worker cannot finish to release semaphore. We need
>>> signal eviction fence to solve the live locking if evict queue return
>>> -ETIMEOUT (for MES path) or -ETIME (for HWS path) because CP hangs,
>>>
>>> amdgpu 0000:af:00.0: amdgpu: GPU reset(21) succeeded!
>>> rcu: INFO: rcu_sched self-detected stall on CPU
>>>
>>> Workqueue: kfd_restore_wq restore_process_worker [amdgpu]
>>> Call Trace:
>>> update_process_times+0x94/0xd0
>>> RIP: 0010:amdgpu_vm_handle_moved+0x9a/0x210 [amdgpu]
>>> amdgpu_amdkfd_gpuvm_restore_process_bos+0x3d6/0x7d0 [amdgpu]
>>> restore_process_helper+0x27/0x80 [amdgpu]
>>>
>>> Signed-off-by: Philip Yang <Philip.Yang at amd.com>
>>
>> See comments inline. I'd also like Christian to take a look at this patch since he's the expert on the reset locking stuff.
>>
>>
>>> ---
>>> drivers/gpu/drm/amd/amdkfd/kfd_process.c | 56 +++++++++++++++++++++++-
>>> 1 file changed, 55 insertions(+), 1 deletion(-)
>>>
>>> diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_process.c b/drivers/gpu/drm/amd/amdkfd/kfd_process.c
>>> index a902950cc060..53a814347522 100644
>>> --- a/drivers/gpu/drm/amd/amdkfd/kfd_process.c
>>> +++ b/drivers/gpu/drm/amd/amdkfd/kfd_process.c
>>> @@ -35,6 +35,7 @@
>>> #include <linux/pm_runtime.h>
>>> #include "amdgpu_amdkfd.h"
>>> #include "amdgpu.h"
>>> +#include "amdgpu_reset.h"
>>> struct mm_struct;
>>> @@ -1972,8 +1973,14 @@ static void evict_process_worker(struct work_struct *work)
>>> kfd_process_restore_queues(p);
>>> pr_debug("Finished evicting pasid 0x%x\n", p->pasid);
>>> - } else
>>> + } else if (ret == -ETIMEDOUT || ret == -ETIME) {
>>> + /* If CP hangs, signal the eviction fence, then restore_bo_worker
>>> + * can finish to up_read GPU reset semaphore to start GPU reset.
>>> + */
>>> + signal_eviction_fence(p);
>>> + } else {
>>> pr_err("Failed to evict queues of pasid 0x%x\n", p->pasid);
>>> + }
>>> }
>>> static int restore_process_helper(struct kfd_process *p)
>>> @@ -1997,6 +2004,45 @@ static int restore_process_helper(struct kfd_process *p)
>>> return ret;
>>> }
>>> +/*
>>> + * kfd_hold_devices_reset_semaphore
>>> + *
>>> + * return:
>>> + * true : hold reset domain semaphore to prevent device reset
>>> + * false: one of the device is resetting or already reset
>>> + *
>>> + */
>>> +static bool kfd_hold_devices_reset_semaphore(struct kfd_process *p)
>>
>> I find the function naming of these functions (hold/unhold) a bit weird. I'd suggest kfd_process_trylock_reset_sems/kfd_process_unlock_reset_sems.
> ok
>>
>>
>>> +{
>>> + struct amdgpu_device *adev;
>>> + int i;
>>> +
>>> + for (i = 0; i < p->n_pdds; i++) {
>>> + adev = p->pdds[i]->dev->adev;
>>> + if (!down_read_trylock(&adev->reset_domain->sem))
>>> + goto out_upread;
>>> + }
>>> + return true;
>>> +
>>> +out_upread:
>>> + while (i--) {
>>> + adev = p->pdds[i]->dev->adev;
>>> + up_read(&adev->reset_domain->sem);
>>> + }
>>> + return false;
>>> +}
>>> +
>>> +static void kfd_unhold_devices_reset_semaphore(struct kfd_process *p)
>>> +{
>>> + struct amdgpu_device *adev;
>>> + int i;
>>> +
>>> + for (i = 0; i < p->n_pdds; i++) {
>>> + adev = p->pdds[i]->dev->adev;
>>> + up_read(&adev->reset_domain->sem);
>>> + }
>>> +}
>>> +
>>> static void restore_process_worker(struct work_struct *work)
>>> {
>>> struct delayed_work *dwork;
>>> @@ -2009,6 +2055,12 @@ static void restore_process_worker(struct work_struct *work)
>>> * lifetime of this thread, kfd_process p will be valid
>>> */
>>> p = container_of(dwork, struct kfd_process, restore_work);
>>> +
>>> + if (!kfd_hold_devices_reset_semaphore(p)) {
>>> + pr_debug("GPU resetting, restore bo and queue skipped\n");
>>
>> Should we reschedule the restore worker to make sure it runs again after the reset is done?
>
> After GPU mode1 reset, user processes already aborted, it is meaningless to reschedule restore worker to update GPU mapping.
OK. With the fixed function names, the patch is
Reviewed-by: Felix Kuehling <felix.kuehling at amd.com>
>
> Regards,
>
> Philip
>
>>
>> Thanks,
>> Felix
>>
>>
>>> + return;
>>> + }
>>> +
>>> pr_debug("Started restoring pasid 0x%x\n", p->pasid);
>>> /* Setting last_restore_timestamp before successful restoration.
>>> @@ -2031,6 +2083,8 @@ static void restore_process_worker(struct work_struct *work)
>>> msecs_to_jiffies(PROCESS_RESTORE_TIME_MS)))
>>> kfd_process_restore_queues(p);
>>> }
>>> +
>>> + kfd_unhold_devices_reset_semaphore(p);
>>> }
>>> void kfd_suspend_all_processes(void)
More information about the amd-gfx
mailing list