[PATCH] drm/amdkfd: Don't call mmput from MMU notifier callback

Tue Jun 24 15:46:56 UTC 2025

On 2025-06-24 11:37, Chen, Xiaogang wrote:
>
> On 6/24/2025 10:00 AM, Philip Yang wrote:
>>
>> On 2025-06-23 18:18, Chen, Xiaogang wrote:
>>>
>>>
>>> On 6/23/2025 11:59 AM, Philip Yang wrote:
>>>> If the process is exiting, the mmput inside mmu notifier callback from
>>>> compactd or fork or numa balancing could release the last reference
>>>> of mm struct to call exit_mmap and free_pgtable, this triggers deadlock
>>>> with below backtrace.
>>>>
>>>> The deadlock will leak kfd process as mmu notifier release is not called
>>>> and cause VRAM leaking.
>>>>
>>>> The fix is to take mm reference mmget_non_zero when adding prange to the
>>>> deferred list to pair with mmput in deferred list work.
>>>>
>>>> The backtrace of hung task:
>>>>
>>>>   INFO: task python:348105 blocked for more than 64512 seconds.
>>>>   Call Trace:
>>>>    __schedule+0x1c3/0x550
>>>>    schedule+0x46/0xb0
>>>>    rwsem_down_write_slowpath+0x24b/0x4c0
>>>>    unlink_anon_vmas+0xb1/0x1c0
>>>>    free_pgtables+0xa9/0x130
>>>>    exit_mmap+0xbc/0x1a0
>>>>    mmput+0x5a/0x140
>>>>    svm_range_cpu_invalidate_pagetables+0x2b/0x40 [amdgpu]
>>>>    mn_itree_invalidate+0x72/0xc0
>>>>    __mmu_notifier_invalidate_range_start+0x48/0x60
>>>>    try_to_unmap_one+0x10fa/0x1400
>>>>    rmap_walk_anon+0x196/0x460
>>>>    try_to_unmap+0xbb/0x210
>>>>    migrate_page_unmap+0x54d/0x7e0
>>>>    migrate_pages_batch+0x1c3/0xae0
>>>>    migrate_pages_sync+0x98/0x240
>>>>    migrate_pages+0x25c/0x520
>>>>    compact_zone+0x29d/0x590
>>>>    compact_zone_order+0xb6/0xf0
>>>>    try_to_compact_pages+0xbe/0x220
>>>>    __alloc_pages_direct_compact+0x96/0x1a0
>>>>    __alloc_pages_slowpath+0x410/0x930
>>>>    __alloc_pages_nodemask+0x3a9/0x3e0
>>>>    do_huge_pmd_anonymous_page+0xd7/0x3e0
>>>>    __handle_mm_fault+0x5e3/0x5f0
>>>>    handle_mm_fault+0xf7/0x2e0
>>>>    hmm_vma_fault.isra.0+0x4d/0xa0
>>>>    walk_pmd_range.isra.0+0xa8/0x310
>>>>    walk_pud_range+0x167/0x240
>>>>    walk_pgd_range+0x55/0x100
>>>>    __walk_page_range+0x87/0x90
>>>>    walk_page_range+0xf6/0x160
>>>>    hmm_range_fault+0x4f/0x90
>>>>    amdgpu_hmm_range_get_pages+0x123/0x230 [amdgpu]
>>>>    amdgpu_ttm_tt_get_user_pages+0xb1/0x150 [amdgpu]
>>>>    init_user_pages+0xb1/0x2a0 [amdgpu]
>>>>    amdgpu_amdkfd_gpuvm_alloc_memory_of_gpu+0x543/0x7d0 [amdgpu]
>>>>    kfd_ioctl_alloc_memory_of_gpu+0x24c/0x4e0 [amdgpu]
>>>>    kfd_ioctl+0x29d/0x500 [amdgpu]
>>>>
>>>> Fixes: fa582c6f3684 ("drm/amdkfd: Use mmget_not_zero in MMU notifier")
>>>> Signed-off-by: Philip Yang<Philip.Yang at amd.com>
>>>> ---
>>>>   drivers/gpu/drm/amd/amdkfd/kfd_svm.c | 23 +++++++++++------------
>>>>   1 file changed, 11 insertions(+), 12 deletions(-)
>>>>
>>>> diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_svm.c b/drivers/gpu/drm/amd/amdkfd/kfd_svm.c
>>>> index 10d1276f8e1c..5fe92f9a1ce1 100644
>>>> --- a/drivers/gpu/drm/amd/amdkfd/kfd_svm.c
>>>> +++ b/drivers/gpu/drm/amd/amdkfd/kfd_svm.c
>>>> @@ -2392,15 +2392,17 @@ svm_range_add_list_work(struct svm_range_list *svms, struct svm_range *prange,
>>>>               prange->work_item.op != SVM_OP_UNMAP_RANGE)
>>>>               prange->work_item.op = op;
>>>>       } else {
>>>> -        prange->work_item.op = op;
>>>> -
>>>> -        /* Pairs with mmput in deferred_list_work */
>>>> -        mmget(mm);
>>>> -        prange->work_item.mm = mm;
>>>> -        list_add_tail(&prange->deferred_list,
>>>> - &prange->svms->deferred_range_list);
>>>> -        pr_debug("add prange 0x%p [0x%lx 0x%lx] to work list op %d\n",
>>>> -             prange, prange->start, prange->last, op);
>>>> +        /* Pairs with mmput in deferred_list_work.
>>>> +         * If process is exiting and mm is gone, don't update mmu notifier.
>>>> +         */
>>>> +        if (mmget_not_zero(mm)) {
>>> If process is exiting we not need do schedule_deferred_list_work neither. I think this part code need be reorganized with mmget_not_zero(mm) check.
>> yes, that is right, we could change svm_range_add_list_work to return true only if new work_item is added to the deferred list, and schedule deferred work. This optimization could be another patch, not related to the deadlock issue.
>>>> +            prange->work_item.mm = mm;
>>>> +            prange->work_item.op = op;
>>>> +            list_add_tail(&prange->deferred_list,
>>>> + &prange->svms->deferred_range_list);
>>>> +            pr_debug("add prange 0x%p [0x%lx 0x%lx] to work list op %d\n",
>>>> +                 prange, prange->start, prange->last, op);
>>>> +        }
>>>>       }
>>>>       spin_unlock(&svms->deferred_list_lock);
>>>>   }
>>>> @@ -2568,8 +2570,6 @@ svm_range_cpu_invalidate_pagetables(struct mmu_interval_notifier *mni,
>>>>         if (range->event == MMU_NOTIFY_RELEASE)
>>>>           return true;
>>>> -    if (!mmget_not_zero(mni->mm))
>>>> -        return true;
>>>
>>> Why remove mmget_not_zero(mni->mm) /mmput(mni->mm) here? I think they are for different purpose from mmget_not_zero(mm) at svm_range_add_list_work.
>>>
>> Yes, the mmget_non_zero/mmput inside MMU notifier was added to fix another issue that deferred work access invalid mm, to ensure deferred work_item.mm hold mm reference. But in the backtrace case, mmput inside MMU notifier callback release the last reference and cause deadlock.
>
> You have used mmget_not_zero at svm_range_add_list_work to prevent this deadlock. So mmput inside MMU notifier callback should not cause deadlock now. mmget_non_zero/mmput inside MMU notifier is used at up level to prevent invalid mm being used in all cases. They should be still there.

No, it's OK. We can use the mm pointer in the MMU notifier without calling mmget or mmget_not_zero, even when the refcount is 0. There are two refcounts in the mm_struct. The mmget/mmput refcount represents the lifetime of the address space. There is also mmgrab/mmdrop that protects the data structure itself from being freed. That refcount must be held by the caller of the MMU notifier. Otherwise they couldn't rely on the mm pointer themselves.

We need to take another mm reference whenever we store an mm pointer in our own data structures, such as a work item. Philip's change accomplishes that by only taking mmget_not_zero where we pass the mm pointer into a work item, and dropping the reference in the work item itself.

Regards,
  Felix

>
> Regards
>
> Xiaogang
>
>>
>> Regards,
>>
>> Philip
>>
>>> Regards
>>>
>>> Xiaogang
>>>
>>>>         start = mni->interval_tree.start;
>>>>       last = mni->interval_tree.last;
>>>> @@ -2596,7 +2596,6 @@ svm_range_cpu_invalidate_pagetables(struct mmu_interval_notifier *mni,
>>>>       }
>>>>         svm_range_unlock(prange);
>>>> -    mmput(mni->mm);
>>>>         return true;
>>>>   }