[PATCH] drm/amdkfd: Don't call mmput from MMU notifier callback

Tue Jun 24 15:37:48 UTC 2025

On 6/24/2025 10:00 AM, Philip Yang wrote:
>
> On 2025-06-23 18:18, Chen, Xiaogang wrote:
>>
>>
>> On 6/23/2025 11:59 AM, Philip Yang wrote:
>>> If the process is exiting, the mmput inside mmu notifier callback from
>>> compactd or fork or numa balancing could release the last reference
>>> of mm struct to call exit_mmap and free_pgtable, this triggers deadlock
>>> with below backtrace.
>>>
>>> The deadlock will leak kfd process as mmu notifier release is not 
>>> called
>>> and cause VRAM leaking.
>>>
>>> The fix is to take mm reference mmget_non_zero when adding prange to 
>>> the
>>> deferred list to pair with mmput in deferred list work.
>>>
>>> The backtrace of hung task:
>>>
>>>   INFO: task python:348105 blocked for more than 64512 seconds.
>>>   Call Trace:
>>>    __schedule+0x1c3/0x550
>>>    schedule+0x46/0xb0
>>>    rwsem_down_write_slowpath+0x24b/0x4c0
>>>    unlink_anon_vmas+0xb1/0x1c0
>>>    free_pgtables+0xa9/0x130
>>>    exit_mmap+0xbc/0x1a0
>>>    mmput+0x5a/0x140
>>>    svm_range_cpu_invalidate_pagetables+0x2b/0x40 [amdgpu]
>>>    mn_itree_invalidate+0x72/0xc0
>>>    __mmu_notifier_invalidate_range_start+0x48/0x60
>>>    try_to_unmap_one+0x10fa/0x1400
>>>    rmap_walk_anon+0x196/0x460
>>>    try_to_unmap+0xbb/0x210
>>>    migrate_page_unmap+0x54d/0x7e0
>>>    migrate_pages_batch+0x1c3/0xae0
>>>    migrate_pages_sync+0x98/0x240
>>>    migrate_pages+0x25c/0x520
>>>    compact_zone+0x29d/0x590
>>>    compact_zone_order+0xb6/0xf0
>>>    try_to_compact_pages+0xbe/0x220
>>>    __alloc_pages_direct_compact+0x96/0x1a0
>>>    __alloc_pages_slowpath+0x410/0x930
>>>    __alloc_pages_nodemask+0x3a9/0x3e0
>>>    do_huge_pmd_anonymous_page+0xd7/0x3e0
>>>    __handle_mm_fault+0x5e3/0x5f0
>>>    handle_mm_fault+0xf7/0x2e0
>>>    hmm_vma_fault.isra.0+0x4d/0xa0
>>>    walk_pmd_range.isra.0+0xa8/0x310
>>>    walk_pud_range+0x167/0x240
>>>    walk_pgd_range+0x55/0x100
>>>    __walk_page_range+0x87/0x90
>>>    walk_page_range+0xf6/0x160
>>>    hmm_range_fault+0x4f/0x90
>>>    amdgpu_hmm_range_get_pages+0x123/0x230 [amdgpu]
>>>    amdgpu_ttm_tt_get_user_pages+0xb1/0x150 [amdgpu]
>>>    init_user_pages+0xb1/0x2a0 [amdgpu]
>>>    amdgpu_amdkfd_gpuvm_alloc_memory_of_gpu+0x543/0x7d0 [amdgpu]
>>>    kfd_ioctl_alloc_memory_of_gpu+0x24c/0x4e0 [amdgpu]
>>>    kfd_ioctl+0x29d/0x500 [amdgpu]
>>>
>>> Fixes: fa582c6f3684 ("drm/amdkfd: Use mmget_not_zero in MMU notifier")
>>> Signed-off-by: Philip Yang<Philip.Yang at amd.com>
>>> ---
>>>   drivers/gpu/drm/amd/amdkfd/kfd_svm.c | 23 +++++++++++------------
>>>   1 file changed, 11 insertions(+), 12 deletions(-)
>>>
>>> diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_svm.c 
>>> b/drivers/gpu/drm/amd/amdkfd/kfd_svm.c
>>> index 10d1276f8e1c..5fe92f9a1ce1 100644
>>> --- a/drivers/gpu/drm/amd/amdkfd/kfd_svm.c
>>> +++ b/drivers/gpu/drm/amd/amdkfd/kfd_svm.c
>>> @@ -2392,15 +2392,17 @@ svm_range_add_list_work(struct 
>>> svm_range_list *svms, struct svm_range *prange,
>>>               prange->work_item.op != SVM_OP_UNMAP_RANGE)
>>>               prange->work_item.op = op;
>>>       } else {
>>> -        prange->work_item.op = op;
>>> -
>>> -        /* Pairs with mmput in deferred_list_work */
>>> -        mmget(mm);
>>> -        prange->work_item.mm = mm;
>>> -        list_add_tail(&prange->deferred_list,
>>> - &prange->svms->deferred_range_list);
>>> -        pr_debug("add prange 0x%p [0x%lx 0x%lx] to work list op %d\n",
>>> -             prange, prange->start, prange->last, op);
>>> +        /* Pairs with mmput in deferred_list_work.
>>> +         * If process is exiting and mm is gone, don't update mmu 
>>> notifier.
>>> +         */
>>> +        if (mmget_not_zero(mm)) {
>> If process is exiting we not need do schedule_deferred_list_work 
>> neither. I think this part code need be reorganized with 
>> mmget_not_zero(mm) check.
> yes, that is right, we could change svm_range_add_list_work to return 
> true only if new work_item is added to the deferred list, and schedule 
> deferred work. This optimization could be another patch, not related 
> to the deadlock issue.
>>> +            prange->work_item.mm = mm;
>>> +            prange->work_item.op = op;
>>> +            list_add_tail(&prange->deferred_list,
>>> + &prange->svms->deferred_range_list);
>>> +            pr_debug("add prange 0x%p [0x%lx 0x%lx] to work list op 
>>> %d\n",
>>> +                 prange, prange->start, prange->last, op);
>>> +        }
>>>       }
>>>       spin_unlock(&svms->deferred_list_lock);
>>>   }
>>> @@ -2568,8 +2570,6 @@ svm_range_cpu_invalidate_pagetables(struct 
>>> mmu_interval_notifier *mni,
>>>         if (range->event == MMU_NOTIFY_RELEASE)
>>>           return true;
>>> -    if (!mmget_not_zero(mni->mm))
>>> -        return true;
>>
>> Why remove mmget_not_zero(mni->mm) /mmput(mni->mm) here? I think they 
>> are for different purpose from mmget_not_zero(mm) at 
>> svm_range_add_list_work.
>>
> Yes, the mmget_non_zero/mmput inside MMU notifier was added to fix 
> another issue that deferred work access invalid mm, to ensure deferred 
> work_item.mm hold mm reference. But in the backtrace case, mmput 
> inside MMU notifier callback release the last reference and cause 
> deadlock.

You have used mmget_not_zero at svm_range_add_list_work to prevent this 
deadlock. So mmput inside MMU notifier callback should not cause 
deadlock now. mmget_non_zero/mmput inside MMU notifier is used at up 
level to prevent invalid mm being used in all cases. They should be 
still there.

Regards

Xiaogang

>
> Regards,
>
> Philip
>
>> Regards
>>
>> Xiaogang
>>
>>>         start = mni->interval_tree.start;
>>>       last = mni->interval_tree.last;
>>> @@ -2596,7 +2596,6 @@ svm_range_cpu_invalidate_pagetables(struct 
>>> mmu_interval_notifier *mni,
>>>       }
>>>         svm_range_unlock(prange);
>>> -    mmput(mni->mm);
>>>         return true;
>>>   }