[PATCH v2] drm/amdkfd: not restore userptr buffer if kfd process has been removed
Chen, Xiaogang
xiaogang.chen at amd.com
Fri Oct 4 15:00:08 UTC 2024
On 10/4/2024 9:46 AM, Felix Kuehling wrote:
>
> On 2024-10-04 10:39, Russell, Kent wrote:
>> [Public]
>>
>> My mistake.
>>
>>> -----Original Message-----
>>> From: amd-gfx <amd-gfx-bounces at lists.freedesktop.org> On Behalf Of Russell,
>>> Kent
>>> Sent: Friday, October 4, 2024 9:53 AM
>>> To: Chen, Xiaogang <Xiaogang.Chen at amd.com>; Kuehling, Felix
>>> <Felix.Kuehling at amd.com>; amd-gfx at lists.freedesktop.org
>>> Cc: Yang, Philip <Philip.Yang at amd.com>
>>> Subject: RE: [PATCH v2] drm/amdkfd: not restore userptr buffer if kfd process has
>>> been removed
>>>
>>> [AMD Official Use Only - AMD Internal Distribution Only]
>>>
>>> [AMD Official Use Only - AMD Internal Distribution Only]
>>>
>>>> -----Original Message-----
>>>> From: amd-gfx <amd-gfx-bounces at lists.freedesktop.org> On Behalf Of Chen,
>>>> Xiaogang
>>>> Sent: Thursday, October 3, 2024 6:11 PM
>>>> To: Kuehling, Felix <Felix.Kuehling at amd.com>; amd-gfx at lists.freedesktop.org
>>>> Cc: Yang, Philip <Philip.Yang at amd.com>
>>>> Subject: Re: [PATCH v2] drm/amdkfd: not restore userptr buffer if kfd process has
>>>> been removed
>>>>
>>>>
>>>> On 10/3/2024 4:11 PM, Felix Kuehling wrote:
>>>>> On 2024-10-03 16:55, Xiaogang.Chen wrote:
>>>>>> From: Xiaogang Chen <xiaogang.chen at amd.com>
>>>>>>
>>>>>> When kfd process has been terminated not restore userptr buffer after
>>>>>> mmu notifier invalidates a range.
>>>>>>
>>>>>> Signed-off-by: Xiaogang Chen<Xiaogang.Chen at amd.com>
>>>>>> ---
>>>>>> drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_gpuvm.c | 12 ++++++++----
>>>>>> 1 file changed, 8 insertions(+), 4 deletions(-)
>>>>>>
>>>>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_gpuvm.c
>>>>>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_gpuvm.c
>>>>>> index ce5ca304dba9..1df0926b63b3 100644
>>>>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_gpuvm.c
>>>>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_gpuvm.c
>>>>>> @@ -2524,11 +2524,15 @@ int amdgpu_amdkfd_evict_userptr(struct
>>>>>> mmu_interval_notifier *mni,
>>>>>> /* First eviction, stop the queues */
>>>>>> r = kgd2kfd_quiesce_mm(mni->mm,
>>>>>> KFD_QUEUE_EVICTION_TRIGGER_USERPTR);
>>>>>> - if (r)
>>>>>> +
>>>>>> + if (r && r != -ESRCH)
>>>>>> pr_err("Failed to quiesce KFD\n");
>>>>>> - queue_delayed_work(system_freezable_wq,
>>>>>> - &process_info->restore_userptr_work,
>>>>>> - msecs_to_jiffies(AMDGPU_USERPTR_RESTORE_DELAY_MS));
>>>>>> +
>>>>>> + if (!r || r != -ESRCH) {
>>>>> This condition is always true.
>>>>>
>>>> so sure why this condition is always true? kgd2kfd_quiesce_mm can
>>>> return -ESRCH when it cannot find kfd process correspondent to mni->mm,
>>>> then above checking will be false, then will not queue restore work item
>>>> into system_freezable_wq.
>>> If you expand the 2 conditions, it becomes "if (r !=0 || r != -3)", which will always be
>>> true for any value of r.
>>>
>> I got this wrong. So it's either r==0 or r==-3 (I need some caffeine). The function returns things back up from evict_queues, mqd_destroy, and can eventually return EIO or ETIME in the hqd_destroy function, so r can indeed be different values than 0/-3. Sorry for my confusion here.
> It's (r == 0 || r != -3). So yeah, I was wrong, it's not always true. But it's still confusing because (r == 0) is redundant. I guess you could just write
>
> if (r != -ESRCH) {
> ...
> }
>
> I don't know if that matches your intention.
Yes, the condition check confused people though not wrong. My thought
was: when there is no error, or the error is not -3, do userptr restore.
So only when error is -3, not do userptr restore.
I will send new one to simply it.
Thanks
Xiaogang
>
> Regards,
> Felix
>
>> Kent
>>
>>> Kent
>>>
>>>> Regards
>>>>
>>>> Xiaogang
>>>>
>>>>> Regards,
>>>>> Felix
>>>>>
>>>>>
>>>>>> + queue_delayed_work(system_freezable_wq,
>>>>>> + &process_info->restore_userptr_work,
>>>>>> + msecs_to_jiffies(AMDGPU_USERPTR_RESTORE_DELAY_MS));
>>>>>> + }
>>>>>> }
>>>>>> mutex_unlock(&process_info->notifier_lock);
More information about the amd-gfx
mailing list