[PATCH] drm/amdkfd: improve performance with XNACK enable

Mon May 12 06:57:28 UTC 2025

On 5/9/25 14:45, James Zhu wrote:
> 
> On 2025-05-09 02:00, Christian König wrote:
>> On 5/8/25 19:25, James Zhu wrote:
>>> On 2025-05-08 11:20, James Zhu wrote:
>>>> On 2025-05-08 10:50, Christian König wrote:
>>>>> On 5/8/25 16:46, James Zhu wrote:
>>>>>> When XNACK on, hang or low performance is observed with some test cases.
>>>>>> The restoring page process has unexpected stuck during evicting/restoring
>>>>>> if some bo's flag has KFD_IOCTL_SVM_FLAG_GPU_ALWAYS_MAPPED setting:
>>>>>> 1. when xnack on, retry pagefault will invoke restoring pages process
>>>>>> 2. A. if there is enough VRAM space, simply migrating pages from ram to vram
>>>>>>     B. if there is no enough VRAM space left, searching resource LRU list, and
>>>>>>        scheduling a new eviction work queue to evict LRU bo from vram to ram
>>>>>>        first, then resume restoring pages process, or waiting for eviction
>>>>>>        timeout and try to schedule evicting next LRU bo
>>>>>> 3. for case 2B, if bo has KFD_IOCTL_SVM_FLAG_GPU_ALWAYS_MAPPED setting,
>>>>>>     queue eviction will be triggered.So restoring work queue will be scheduled.
>>>>>> 4. step 1, restoring pages process will hold one mm->mmap_lock's read until
>>>>>>     restoring pages is completed
>>>>>>     step 2B, evictiion work queue process will hold one mm->mmap_lock's read
>>>>>>     until evicting bo is completed
>>>>>>     step 3, restoring work queue process is trying to acquire one mm->mmap_lock's
>>>>>>     write after the above two mm->mmap_lock's read are released, and in the
>>>>>>     meantime which will block all following mm->mmap_lock's read request.
>>>>>> 5. in step 2, if the first eviction bo's size is big enough for step 1
>>>>>>     restoring pages request, everything is fine. if not, which means that the
>>>>>>     mm->mmap_lock's read step 1 won't be release right the way. In step 3, first
>>>>>>     eviction bo's restoring work queue will compete for mm->mmap_lock's write,
>>>>>>     the second and following LRU bo's evictiion work queue will be blocked by
>>>>>>     tring to acquire mm->mmap_lock's read until timeout. All restoring pages
>>>>>>     process will be stuck here.
>>>>>> Using down_write_trylock to replace mmap_write_lock will help not block the
>>>>>> second and following evictiion work queue process.
>>>>>>
>>>>>> Signed-off-by: James Zhu <James.Zhu at amd.com>
>>>>>> ---
>>>>>>   drivers/gpu/drm/amd/amdkfd/kfd_svm.c | 6 +++++-
>>>>>>   1 file changed, 5 insertions(+), 1 deletion(-)
>>>>>>
>>>>>> diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_svm.c b/drivers/gpu/drm/amd/amdkfd/kfd_svm.c
>>>>>> index 72be6e152e88..5f6ed70559b7 100644
>>>>>> --- a/drivers/gpu/drm/amd/amdkfd/kfd_svm.c
>>>>>> +++ b/drivers/gpu/drm/amd/amdkfd/kfd_svm.c
>>>>>> @@ -1794,7 +1794,11 @@ svm_range_list_lock_and_flush_work(struct svm_range_list *svms,
>>>>>>   {
>>>>>>   retry_flush_work:
>>>>>>       flush_work(&svms->deferred_list_work);
>>>>>> -    mmap_write_lock(mm);
>>>>>> +        while (true) {
>>>>>> +                if (down_write_trylock(&(mm->mmap_lock)))
>>>>>> +                        break;
>>>>>> +                schedule();
>>>>>> +        }
>>>>> Oh, stuff like that is usually an absolutely clear NAK from upstream.
>>>>>
>>>>> As far as I know that is not something we can do so easily.
>>>>>
>>>>> Would it be possible to wait for progress instead of calling schedule() here?
>>>> [JZ]  At 1st beginning, I am thinking adding sync with restoring pages done.
>>>>
>>>> but the original restoring work design philosophy is blindly scheduled after certain delay.
>>>>
>>>> the changes with sync may take more time and risk. I would like Felix and Philip give comments
>>>>
>>>> if there is efficient and safe way to fix it. Thanks!
>>> [JZ] BTW, in worse case, mmap_write_lock will fall into rwsem_down_write_slowpath(), schedule_preempt_disabled() and schedule();
>> Yeah, but drivers are not allowed to re-implement or even bypass that logic.
> 
> [JZ] here  can take as a new version mmap_write_lock without blocking the following read request and competing for write
> 
> lock which means read has higher priority under this case. Logically sync is better way to replace it. In practice, under
> 
> current scenery, sync will increase other burdens an risk.

Yeah and exactly that is messing with mutex internals and is an absolutely clear no-go.

Stuff like that is not allowed, not even remotely.

Regards,
Christian.

> 
>> Regards,
>> Christian.
>>
>>>>> Regards,
>>>>> Christian.
>>>>>
>>>>>>         if (list_empty(&svms->deferred_range_list))
>>>>>>           return;