Re: 回复: [RFC PATCH 1/2] drm/amdgpu: Fix memory corruption due to swapout and swapin

Felix Kuehling felix.kuehling at amd.com
Wed May 19 15:11:11 UTC 2021


Looks like we're creating the userptr BO as ttm_bo_type_device. I guess
we should be using ttm_bo_type_sg? BTW, amdgpu_gem_userptr_ioctl also
uses ttm_bo_type_device.

Regards,
  Felix


Am 2021-05-19 um 6:01 a.m. schrieb Christian König:
> I'm scratching my head how that is even possible.
>
> See when a BO is created in the system domain it is just an empty
> hull, e.g. without backing store and allocated pages.
>
> So the swapout function will just ignore it.
>
> Christian.
>
> Am 19.05.21 um 07:07 schrieb Pan, Xinhui:
>> [AMD Official Use Only]
>>
>> I have reverted Chris'  patch, still hit this failure.
>> Just see two lines in Chris' patch. Any BO in cpu domian would be
>> swapout first. That is why we hit this issue frequently now. But the
>> bug is there long time ago.
>>
>> -       for (i = 0; i < TTM_MAX_BO_PRIORITY; ++i) {
>> -               list_for_each_entry(bo, &glob->swap_lru[i], swap) {
>> [snip]
>> +       for (i = TTM_PL_SYSTEM; i < TTM_NUM_MEM_TYPES; ++i) {
>> +               for (j = 0; j < TTM_MAX_BO_PRIORITY; ++j) {
>>
>>
>> ________________________________________
>> 发件人: Pan, Xinhui <Xinhui.Pan at amd.com>
>> 发送时间: 2021年5月19日 12:09
>> 收件人: Kuehling, Felix; amd-gfx at lists.freedesktop.org
>> 抄送: Deucher, Alexander; Koenig, Christian;
>> dri-devel at lists.freedesktop.org; daniel at ffwll.ch
>> 主题: 回复: [RFC PATCH 1/2] drm/amdgpu: Fix memory corruption due to
>> swapout and swapin
>>
>> yes, we really dont swapout SG BOs.
>> The problems is that before we validate a userptr BO, we create this
>> BO in CPU domain by default. So this BO has chance to swapout.
>>
>> we set flag TTM_PAGE_FLAG_SG on userptr BO in popluate() which is too
>> late.
>> I have not try to revert Chris' patch as I think it desnt help. Or I
>> can have a try later.
>>
>> ________________________________________
>> 发件人: Kuehling, Felix <Felix.Kuehling at amd.com>
>> 发送时间: 2021年5月19日 11:29
>> 收件人: Pan, Xinhui; amd-gfx at lists.freedesktop.org
>> 抄送: Deucher, Alexander; Koenig, Christian;
>> dri-devel at lists.freedesktop.org; daniel at ffwll.ch
>> 主题: Re: [RFC PATCH 1/2] drm/amdgpu: Fix memory corruption due to
>> swapout and swapin
>>
>> Swapping SG BOs makes no sense, because TTM doesn't own the pages of
>> this type of BO.
>>
>> Last I checked, userptr BOs (and other SG BOs) were protected from
>> swapout by the fact that they would not be added to the swap-LRU. But it
>> looks like Christian just removed the swap-LRU. I guess this broke that
>> protection:
>>
>> commit 2cb51d22d70b18eaf339abf9758bf0b7608da65c
>> Author: Christian König <christian.koenig at amd.com>
>> Date:   Tue Oct 6 16:30:09 2020 +0200
>>
>>       drm/ttm: remove swap LRU v3
>>
>>       Instead evict round robin from each devices SYSTEM and TT domain.
>>
>>       v2: reorder num_pages access reported by Dan's script
>>       v3: fix rebase fallout, num_pages should be 32bit
>>
>>       Signed-off-by: Christian König <christian.koenig at amd.com>
>>       Tested-by: Nirmoy Das <nirmoy.das at amd.com>
>>       Reviewed-by: Huang Rui <ray.huang at amd.com>
>>       Reviewed-by: Matthew Auld <matthew.auld at intel.com>
>>       Link: https://patchwork.freedesktop.org/patch/424009/
>>
>> Regards,
>>     Felix
>>
>>
>> On 2021-05-18 10:28 p.m., xinhui pan wrote:
>>> cpu 1                                           cpu 2
>>> kfd alloc BO A(userptr)                         alloc BO B(GTT)
>>>       ->init -> validate                               -> init ->
>>> validate -> populate
>>>       init_user_pages                            -> swapout BO A
>>> //hit ttm pages limit
>>>        -> get_user_pages (fill up ttm->pages)
>>>         -> validate -> populate
>>>             -> swapin BO A // Now hit the BUG
>>>
>>> We know that get_user_pages may race with swapout on same BO.
>>> Threre are some issues I have met.
>>> 1) memory corruption.
>>> This is because we do a swap before memory is setup. ttm_tt_swapout()
>>> just create a swap_storage with its content being 0x0. So when we setup
>>> memory after the swapout. The following swapin makes the memory
>>> corrupted.
>>>
>>> 2) panic
>>> When swapout happes with get_user_pages, they touch ttm->pages without
>>> anylock. It causes memory corruption too. But I hit page fault mostly.
>>>
>>> Signed-off-by: xinhui pan <xinhui.pan at amd.com>
>>> ---
>>>    drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_gpuvm.c | 16
>>> +++++++++++++++-
>>>    1 file changed, 15 insertions(+), 1 deletion(-)
>>>
>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_gpuvm.c
>>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_gpuvm.c
>>> index 928e8d57cd08..42460e4480f8 100644
>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_gpuvm.c
>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_gpuvm.c
>>> @@ -835,6 +835,7 @@ static int init_user_pages(struct kgd_mem *mem,
>>> uint64_t user_addr)
>>>        struct amdkfd_process_info *process_info = mem->process_info;
>>>        struct amdgpu_bo *bo = mem->bo;
>>>        struct ttm_operation_ctx ctx = { true, false };
>>> +     struct page **pages;
>>>        int ret = 0;
>>>
>>>        mutex_lock(&process_info->lock);
>>> @@ -852,7 +853,13 @@ static int init_user_pages(struct kgd_mem *mem,
>>> uint64_t user_addr)
>>>                goto out;
>>>        }
>>>
>>> -     ret = amdgpu_ttm_tt_get_user_pages(bo, bo->tbo.ttm->pages);
>>> +     pages = kvmalloc_array(bo->tbo.ttm->num_pages,
>>> +                     sizeof(struct page *),
>>> +                     GFP_KERNEL | __GFP_ZERO);
>>> +     if (!pages)
>>> +             goto unregister_out;
>>> +
>>> +     ret = amdgpu_ttm_tt_get_user_pages(bo, pages);
>>>        if (ret) {
>>>                pr_err("%s: Failed to get user pages: %d\n",
>>> __func__, ret);
>>>                goto unregister_out;
>>> @@ -863,6 +870,12 @@ static int init_user_pages(struct kgd_mem *mem,
>>> uint64_t user_addr)
>>>                pr_err("%s: Failed to reserve BO\n", __func__);
>>>                goto release_out;
>>>        }
>>> +
>>> +     WARN_ON_ONCE(bo->tbo.ttm->page_flags & TTM_PAGE_FLAG_SWAPPED);
>>> +
>>> +     memcpy(bo->tbo.ttm->pages,
>>> +                     pages,
>>> +                     sizeof(struct page*) * bo->tbo.ttm->num_pages);
>>>        amdgpu_bo_placement_from_domain(bo, mem->domain);
>>>        ret = ttm_bo_validate(&bo->tbo, &bo->placement, &ctx);
>>>        if (ret)
>>> @@ -872,6 +885,7 @@ static int init_user_pages(struct kgd_mem *mem,
>>> uint64_t user_addr)
>>>    release_out:
>>>        amdgpu_ttm_tt_get_user_pages_done(bo->tbo.ttm);
>>>    unregister_out:
>>> +     kvfree(pages);
>>>        if (ret)
>>>                amdgpu_mn_unregister(bo);
>>>    out:
>> _______________________________________________
>> amd-gfx mailing list
>> amd-gfx at lists.freedesktop.org
>> https://lists.freedesktop.org/mailman/listinfo/amd-gfx
>


More information about the dri-devel mailing list