Re: 回复: 回复: [RFC PATCH 1/2] drm/amdgpu: Fix memory corruption due to swapout and swapin

Christian König ckoenig.leichtzumerken at gmail.com
Thu May 20 06:38:20 UTC 2021


The problem is that ttm_bo_type_sg doesn't allocate a page array for the 
TT object.

Christian.

Am 20.05.21 um 04:58 schrieb Pan, Xinhui:
> [AMD Official Use Only]
>
> I am not sure if we can create a ttm_bo_type_sg bo for userptr. But I have another idea now. we can use flag AMDGPU_AMDKFD_CREATE_USERPTR_BO to create the userptr bo.
> ________________________________________
> 发件人: Kuehling, Felix <Felix.Kuehling at amd.com>
> 发送时间: 2021年5月19日 23:11
> 收件人: Christian König; Pan, Xinhui; amd-gfx at lists.freedesktop.org
> 抄送: Deucher, Alexander; daniel at ffwll.ch; Koenig, Christian; dri-devel at lists.freedesktop.org
> 主题: Re: 回复: [RFC PATCH 1/2] drm/amdgpu: Fix memory corruption due to swapout and swapin
>
> Looks like we're creating the userptr BO as ttm_bo_type_device. I guess
> we should be using ttm_bo_type_sg? BTW, amdgpu_gem_userptr_ioctl also
> uses ttm_bo_type_device.
>
> Regards,
>    Felix
>
>
> Am 2021-05-19 um 6:01 a.m. schrieb Christian König:
>> I'm scratching my head how that is even possible.
>>
>> See when a BO is created in the system domain it is just an empty
>> hull, e.g. without backing store and allocated pages.
>>
>> So the swapout function will just ignore it.
>>
>> Christian.
>>
>> Am 19.05.21 um 07:07 schrieb Pan, Xinhui:
>>> [AMD Official Use Only]
>>>
>>> I have reverted Chris'  patch, still hit this failure.
>>> Just see two lines in Chris' patch. Any BO in cpu domian would be
>>> swapout first. That is why we hit this issue frequently now. But the
>>> bug is there long time ago.
>>>
>>> -       for (i = 0; i < TTM_MAX_BO_PRIORITY; ++i) {
>>> -               list_for_each_entry(bo, &glob->swap_lru[i], swap) {
>>> [snip]
>>> +       for (i = TTM_PL_SYSTEM; i < TTM_NUM_MEM_TYPES; ++i) {
>>> +               for (j = 0; j < TTM_MAX_BO_PRIORITY; ++j) {
>>>
>>>
>>> ________________________________________
>>> 发件人: Pan, Xinhui <Xinhui.Pan at amd.com>
>>> 发送时间: 2021年5月19日 12:09
>>> 收件人: Kuehling, Felix; amd-gfx at lists.freedesktop.org
>>> 抄送: Deucher, Alexander; Koenig, Christian;
>>> dri-devel at lists.freedesktop.org; daniel at ffwll.ch
>>> 主题: 回复: [RFC PATCH 1/2] drm/amdgpu: Fix memory corruption due to
>>> swapout and swapin
>>>
>>> yes, we really dont swapout SG BOs.
>>> The problems is that before we validate a userptr BO, we create this
>>> BO in CPU domain by default. So this BO has chance to swapout.
>>>
>>> we set flag TTM_PAGE_FLAG_SG on userptr BO in popluate() which is too
>>> late.
>>> I have not try to revert Chris' patch as I think it desnt help. Or I
>>> can have a try later.
>>>
>>> ________________________________________
>>> 发件人: Kuehling, Felix <Felix.Kuehling at amd.com>
>>> 发送时间: 2021年5月19日 11:29
>>> 收件人: Pan, Xinhui; amd-gfx at lists.freedesktop.org
>>> 抄送: Deucher, Alexander; Koenig, Christian;
>>> dri-devel at lists.freedesktop.org; daniel at ffwll.ch
>>> 主题: Re: [RFC PATCH 1/2] drm/amdgpu: Fix memory corruption due to
>>> swapout and swapin
>>>
>>> Swapping SG BOs makes no sense, because TTM doesn't own the pages of
>>> this type of BO.
>>>
>>> Last I checked, userptr BOs (and other SG BOs) were protected from
>>> swapout by the fact that they would not be added to the swap-LRU. But it
>>> looks like Christian just removed the swap-LRU. I guess this broke that
>>> protection:
>>>
>>> commit 2cb51d22d70b18eaf339abf9758bf0b7608da65c
>>> Author: Christian König <christian.koenig at amd.com>
>>> Date:   Tue Oct 6 16:30:09 2020 +0200
>>>
>>>        drm/ttm: remove swap LRU v3
>>>
>>>        Instead evict round robin from each devices SYSTEM and TT domain.
>>>
>>>        v2: reorder num_pages access reported by Dan's script
>>>        v3: fix rebase fallout, num_pages should be 32bit
>>>
>>>        Signed-off-by: Christian König <christian.koenig at amd.com>
>>>        Tested-by: Nirmoy Das <nirmoy.das at amd.com>
>>>        Reviewed-by: Huang Rui <ray.huang at amd.com>
>>>        Reviewed-by: Matthew Auld <matthew.auld at intel.com>
>>>        Link: https://patchwork.freedesktop.org/patch/424009/
>>>
>>> Regards,
>>>      Felix
>>>
>>>
>>> On 2021-05-18 10:28 p.m., xinhui pan wrote:
>>>> cpu 1                                           cpu 2
>>>> kfd alloc BO A(userptr)                         alloc BO B(GTT)
>>>>        ->init -> validate                               -> init ->
>>>> validate -> populate
>>>>        init_user_pages                            -> swapout BO A
>>>> //hit ttm pages limit
>>>>         -> get_user_pages (fill up ttm->pages)
>>>>          -> validate -> populate
>>>>              -> swapin BO A // Now hit the BUG
>>>>
>>>> We know that get_user_pages may race with swapout on same BO.
>>>> Threre are some issues I have met.
>>>> 1) memory corruption.
>>>> This is because we do a swap before memory is setup. ttm_tt_swapout()
>>>> just create a swap_storage with its content being 0x0. So when we setup
>>>> memory after the swapout. The following swapin makes the memory
>>>> corrupted.
>>>>
>>>> 2) panic
>>>> When swapout happes with get_user_pages, they touch ttm->pages without
>>>> anylock. It causes memory corruption too. But I hit page fault mostly.
>>>>
>>>> Signed-off-by: xinhui pan <xinhui.pan at amd.com>
>>>> ---
>>>>     drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_gpuvm.c | 16
>>>> +++++++++++++++-
>>>>     1 file changed, 15 insertions(+), 1 deletion(-)
>>>>
>>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_gpuvm.c
>>>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_gpuvm.c
>>>> index 928e8d57cd08..42460e4480f8 100644
>>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_gpuvm.c
>>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_gpuvm.c
>>>> @@ -835,6 +835,7 @@ static int init_user_pages(struct kgd_mem *mem,
>>>> uint64_t user_addr)
>>>>         struct amdkfd_process_info *process_info = mem->process_info;
>>>>         struct amdgpu_bo *bo = mem->bo;
>>>>         struct ttm_operation_ctx ctx = { true, false };
>>>> +     struct page **pages;
>>>>         int ret = 0;
>>>>
>>>>         mutex_lock(&process_info->lock);
>>>> @@ -852,7 +853,13 @@ static int init_user_pages(struct kgd_mem *mem,
>>>> uint64_t user_addr)
>>>>                 goto out;
>>>>         }
>>>>
>>>> -     ret = amdgpu_ttm_tt_get_user_pages(bo, bo->tbo.ttm->pages);
>>>> +     pages = kvmalloc_array(bo->tbo.ttm->num_pages,
>>>> +                     sizeof(struct page *),
>>>> +                     GFP_KERNEL | __GFP_ZERO);
>>>> +     if (!pages)
>>>> +             goto unregister_out;
>>>> +
>>>> +     ret = amdgpu_ttm_tt_get_user_pages(bo, pages);
>>>>         if (ret) {
>>>>                 pr_err("%s: Failed to get user pages: %d\n",
>>>> __func__, ret);
>>>>                 goto unregister_out;
>>>> @@ -863,6 +870,12 @@ static int init_user_pages(struct kgd_mem *mem,
>>>> uint64_t user_addr)
>>>>                 pr_err("%s: Failed to reserve BO\n", __func__);
>>>>                 goto release_out;
>>>>         }
>>>> +
>>>> +     WARN_ON_ONCE(bo->tbo.ttm->page_flags & TTM_PAGE_FLAG_SWAPPED);
>>>> +
>>>> +     memcpy(bo->tbo.ttm->pages,
>>>> +                     pages,
>>>> +                     sizeof(struct page*) * bo->tbo.ttm->num_pages);
>>>>         amdgpu_bo_placement_from_domain(bo, mem->domain);
>>>>         ret = ttm_bo_validate(&bo->tbo, &bo->placement, &ctx);
>>>>         if (ret)
>>>> @@ -872,6 +885,7 @@ static int init_user_pages(struct kgd_mem *mem,
>>>> uint64_t user_addr)
>>>>     release_out:
>>>>         amdgpu_ttm_tt_get_user_pages_done(bo->tbo.ttm);
>>>>     unregister_out:
>>>> +     kvfree(pages);
>>>>         if (ret)
>>>>                 amdgpu_mn_unregister(bo);
>>>>     out:
>>> _______________________________________________
>>> amd-gfx mailing list
>>> amd-gfx at lists.freedesktop.org
>>> https://lists.freedesktop.org/mailman/listinfo/amd-gfx



More information about the amd-gfx mailing list