[PATCH v2] drm/amdkfd: not restore userptr buffer if kfd process has been removed

Chen, Xiaogang xiaogang.chen at amd.com
Fri Oct 4 15:00:08 UTC 2024


On 10/4/2024 9:46 AM, Felix Kuehling wrote:
>
> On 2024-10-04 10:39, Russell, Kent wrote:
>> [Public]
>>
>> My mistake.
>>
>>> -----Original Message-----
>>> From: amd-gfx <amd-gfx-bounces at lists.freedesktop.org> On Behalf Of Russell,
>>> Kent
>>> Sent: Friday, October 4, 2024 9:53 AM
>>> To: Chen, Xiaogang <Xiaogang.Chen at amd.com>; Kuehling, Felix
>>> <Felix.Kuehling at amd.com>; amd-gfx at lists.freedesktop.org
>>> Cc: Yang, Philip <Philip.Yang at amd.com>
>>> Subject: RE: [PATCH v2] drm/amdkfd: not restore userptr buffer if kfd process has
>>> been removed
>>>
>>> [AMD Official Use Only - AMD Internal Distribution Only]
>>>
>>> [AMD Official Use Only - AMD Internal Distribution Only]
>>>
>>>> -----Original Message-----
>>>> From: amd-gfx <amd-gfx-bounces at lists.freedesktop.org> On Behalf Of Chen,
>>>> Xiaogang
>>>> Sent: Thursday, October 3, 2024 6:11 PM
>>>> To: Kuehling, Felix <Felix.Kuehling at amd.com>; amd-gfx at lists.freedesktop.org
>>>> Cc: Yang, Philip <Philip.Yang at amd.com>
>>>> Subject: Re: [PATCH v2] drm/amdkfd: not restore userptr buffer if kfd process has
>>>> been removed
>>>>
>>>>
>>>> On 10/3/2024 4:11 PM, Felix Kuehling wrote:
>>>>> On 2024-10-03 16:55, Xiaogang.Chen wrote:
>>>>>> From: Xiaogang Chen <xiaogang.chen at amd.com>
>>>>>>
>>>>>> When kfd process has been terminated not restore userptr buffer after
>>>>>> mmu notifier invalidates a range.
>>>>>>
>>>>>> Signed-off-by: Xiaogang Chen<Xiaogang.Chen at amd.com>
>>>>>> ---
>>>>>>    drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_gpuvm.c | 12 ++++++++----
>>>>>>    1 file changed, 8 insertions(+), 4 deletions(-)
>>>>>>
>>>>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_gpuvm.c
>>>>>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_gpuvm.c
>>>>>> index ce5ca304dba9..1df0926b63b3 100644
>>>>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_gpuvm.c
>>>>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_gpuvm.c
>>>>>> @@ -2524,11 +2524,15 @@ int amdgpu_amdkfd_evict_userptr(struct
>>>>>> mmu_interval_notifier *mni,
>>>>>>            /* First eviction, stop the queues */
>>>>>>            r = kgd2kfd_quiesce_mm(mni->mm,
>>>>>>                           KFD_QUEUE_EVICTION_TRIGGER_USERPTR);
>>>>>> -        if (r)
>>>>>> +
>>>>>> +        if (r && r != -ESRCH)
>>>>>>                pr_err("Failed to quiesce KFD\n");
>>>>>> -        queue_delayed_work(system_freezable_wq,
>>>>>> -            &process_info->restore_userptr_work,
>>>>>> -            msecs_to_jiffies(AMDGPU_USERPTR_RESTORE_DELAY_MS));
>>>>>> +
>>>>>> +        if (!r || r != -ESRCH) {
>>>>> This condition is always true.
>>>>>
>>>> so sure why this condition is always true?  kgd2kfd_quiesce_mm can
>>>> return -ESRCH when it cannot find kfd process correspondent to mni->mm,
>>>> then above checking will be false, then will not queue restore work item
>>>> into system_freezable_wq.
>>> If you expand the 2 conditions, it becomes "if (r !=0 || r != -3)", which will always be
>>> true for any value of r.
>>>
>> I got this wrong. So it's either r==0 or r==-3 (I need some caffeine). The function returns things back up from evict_queues, mqd_destroy, and can eventually return EIO or ETIME in the hqd_destroy function, so r can indeed be different values than 0/-3. Sorry for my confusion here.
> It's (r == 0 || r != -3). So yeah, I was wrong, it's not always true. But it's still confusing because (r == 0) is redundant. I guess you could just write
>
> 	if (r != -ESRCH) {
> 		...
> 	}
>
> I don't know if that matches your intention.

Yes, the condition check confused people though not wrong. My thought 
was: when there is no error, or the error is not -3, do userptr restore. 
So only when error is -3, not do userptr restore.

I will send new one to simply it.


Thanks

Xiaogang

>
> Regards,
>    Felix
>
>>   Kent
>>
>>>   Kent
>>>
>>>> Regards
>>>>
>>>> Xiaogang
>>>>
>>>>> Regards,
>>>>>    Felix
>>>>>
>>>>>
>>>>>> + queue_delayed_work(system_freezable_wq,
>>>>>> +                &process_info->restore_userptr_work,
>>>>>> + msecs_to_jiffies(AMDGPU_USERPTR_RESTORE_DELAY_MS));
>>>>>> +        }
>>>>>>        }
>>>>>>        mutex_unlock(&process_info->notifier_lock);


More information about the amd-gfx mailing list