回复: [RFC PATCH 1/2] drm/amdgpu: Fix memory corruption due to swapout and swapin

Pan, Xinhui Xinhui.Pan at amd.com
Wed May 19 04:09:28 UTC 2021


[AMD Official Use Only]

yes, we really dont swapout SG BOs.
The problems is that before we validate a userptr BO, we create this BO in CPU domain by default. So this BO has chance to swapout.

we set flag TTM_PAGE_FLAG_SG on userptr BO in popluate() which is too late.
I have not try to revert Chris' patch as I think it desnt help. Or I can have a try later.

________________________________________
发件人: Kuehling, Felix <Felix.Kuehling at amd.com>
发送时间: 2021年5月19日 11:29
收件人: Pan, Xinhui; amd-gfx at lists.freedesktop.org
抄送: Deucher, Alexander; Koenig, Christian; dri-devel at lists.freedesktop.org; daniel at ffwll.ch
主题: Re: [RFC PATCH 1/2] drm/amdgpu: Fix memory corruption due to swapout and swapin

Swapping SG BOs makes no sense, because TTM doesn't own the pages of
this type of BO.

Last I checked, userptr BOs (and other SG BOs) were protected from
swapout by the fact that they would not be added to the swap-LRU. But it
looks like Christian just removed the swap-LRU. I guess this broke that
protection:

commit 2cb51d22d70b18eaf339abf9758bf0b7608da65c
Author: Christian König <christian.koenig at amd.com>
Date:   Tue Oct 6 16:30:09 2020 +0200

     drm/ttm: remove swap LRU v3

     Instead evict round robin from each devices SYSTEM and TT domain.

     v2: reorder num_pages access reported by Dan's script
     v3: fix rebase fallout, num_pages should be 32bit

     Signed-off-by: Christian König <christian.koenig at amd.com>
     Tested-by: Nirmoy Das <nirmoy.das at amd.com>
     Reviewed-by: Huang Rui <ray.huang at amd.com>
     Reviewed-by: Matthew Auld <matthew.auld at intel.com>
     Link: https://patchwork.freedesktop.org/patch/424009/

Regards,
   Felix


On 2021-05-18 10:28 p.m., xinhui pan wrote:
> cpu 1                                           cpu 2
> kfd alloc BO A(userptr)                         alloc BO B(GTT)
>      ->init -> validate                               -> init -> validate -> populate
>      init_user_pages                            -> swapout BO A //hit ttm pages limit
>       -> get_user_pages (fill up ttm->pages)
>        -> validate -> populate
>            -> swapin BO A // Now hit the BUG
>
> We know that get_user_pages may race with swapout on same BO.
> Threre are some issues I have met.
> 1) memory corruption.
> This is because we do a swap before memory is setup. ttm_tt_swapout()
> just create a swap_storage with its content being 0x0. So when we setup
> memory after the swapout. The following swapin makes the memory
> corrupted.
>
> 2) panic
> When swapout happes with get_user_pages, they touch ttm->pages without
> anylock. It causes memory corruption too. But I hit page fault mostly.
>
> Signed-off-by: xinhui pan <xinhui.pan at amd.com>
> ---
>   drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_gpuvm.c | 16 +++++++++++++++-
>   1 file changed, 15 insertions(+), 1 deletion(-)
>
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_gpuvm.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_gpuvm.c
> index 928e8d57cd08..42460e4480f8 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_gpuvm.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_gpuvm.c
> @@ -835,6 +835,7 @@ static int init_user_pages(struct kgd_mem *mem, uint64_t user_addr)
>       struct amdkfd_process_info *process_info = mem->process_info;
>       struct amdgpu_bo *bo = mem->bo;
>       struct ttm_operation_ctx ctx = { true, false };
> +     struct page **pages;
>       int ret = 0;
>
>       mutex_lock(&process_info->lock);
> @@ -852,7 +853,13 @@ static int init_user_pages(struct kgd_mem *mem, uint64_t user_addr)
>               goto out;
>       }
>
> -     ret = amdgpu_ttm_tt_get_user_pages(bo, bo->tbo.ttm->pages);
> +     pages = kvmalloc_array(bo->tbo.ttm->num_pages,
> +                     sizeof(struct page *),
> +                     GFP_KERNEL | __GFP_ZERO);
> +     if (!pages)
> +             goto unregister_out;
> +
> +     ret = amdgpu_ttm_tt_get_user_pages(bo, pages);
>       if (ret) {
>               pr_err("%s: Failed to get user pages: %d\n", __func__, ret);
>               goto unregister_out;
> @@ -863,6 +870,12 @@ static int init_user_pages(struct kgd_mem *mem, uint64_t user_addr)
>               pr_err("%s: Failed to reserve BO\n", __func__);
>               goto release_out;
>       }
> +
> +     WARN_ON_ONCE(bo->tbo.ttm->page_flags & TTM_PAGE_FLAG_SWAPPED);
> +
> +     memcpy(bo->tbo.ttm->pages,
> +                     pages,
> +                     sizeof(struct page*) * bo->tbo.ttm->num_pages);
>       amdgpu_bo_placement_from_domain(bo, mem->domain);
>       ret = ttm_bo_validate(&bo->tbo, &bo->placement, &ctx);
>       if (ret)
> @@ -872,6 +885,7 @@ static int init_user_pages(struct kgd_mem *mem, uint64_t user_addr)
>   release_out:
>       amdgpu_ttm_tt_get_user_pages_done(bo->tbo.ttm);
>   unregister_out:
> +     kvfree(pages);
>       if (ret)
>               amdgpu_mn_unregister(bo);
>   out:


More information about the dri-devel mailing list