Swapping SG BOs makes no sense, because TTM doesn't own the pages of this type of BO.
Last I checked, userptr BOs (and other SG BOs) were protected from swapout by the fact that they would not be added to the swap-LRU. But it looks like Christian just removed the swap-LRU. I guess this broke that protection:
commit 2cb51d22d70b18eaf339abf9758bf0b7608da65c Author: Christian König christian.koenig@amd.com Date: Tue Oct 6 16:30:09 2020 +0200
drm/ttm: remove swap LRU v3
Instead evict round robin from each devices SYSTEM and TT domain.
v2: reorder num_pages access reported by Dan's script v3: fix rebase fallout, num_pages should be 32bit
Signed-off-by: Christian König christian.koenig@amd.com Tested-by: Nirmoy Das nirmoy.das@amd.com Reviewed-by: Huang Rui ray.huang@amd.com Reviewed-by: Matthew Auld matthew.auld@intel.com Link: https://patchwork.freedesktop.org/patch/424009/
Regards, Felix
On 2021-05-18 10:28 p.m., xinhui pan wrote:
cpu 1 cpu 2 kfd alloc BO A(userptr) alloc BO B(GTT) ->init -> validate -> init -> validate -> populate init_user_pages -> swapout BO A //hit ttm pages limit -> get_user_pages (fill up ttm->pages) -> validate -> populate -> swapin BO A // Now hit the BUG
We know that get_user_pages may race with swapout on same BO. Threre are some issues I have met.
- memory corruption.
This is because we do a swap before memory is setup. ttm_tt_swapout() just create a swap_storage with its content being 0x0. So when we setup memory after the swapout. The following swapin makes the memory corrupted.
- panic
When swapout happes with get_user_pages, they touch ttm->pages without anylock. It causes memory corruption too. But I hit page fault mostly.
Signed-off-by: xinhui pan xinhui.pan@amd.com
drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_gpuvm.c | 16 +++++++++++++++- 1 file changed, 15 insertions(+), 1 deletion(-)
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_gpuvm.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_gpuvm.c index 928e8d57cd08..42460e4480f8 100644 --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_gpuvm.c +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_gpuvm.c @@ -835,6 +835,7 @@ static int init_user_pages(struct kgd_mem *mem, uint64_t user_addr) struct amdkfd_process_info *process_info = mem->process_info; struct amdgpu_bo *bo = mem->bo; struct ttm_operation_ctx ctx = { true, false };
struct page **pages; int ret = 0;
mutex_lock(&process_info->lock);
@@ -852,7 +853,13 @@ static int init_user_pages(struct kgd_mem *mem, uint64_t user_addr) goto out; }
- ret = amdgpu_ttm_tt_get_user_pages(bo, bo->tbo.ttm->pages);
- pages = kvmalloc_array(bo->tbo.ttm->num_pages,
sizeof(struct page *),
GFP_KERNEL | __GFP_ZERO);
- if (!pages)
goto unregister_out;
- ret = amdgpu_ttm_tt_get_user_pages(bo, pages); if (ret) { pr_err("%s: Failed to get user pages: %d\n", __func__, ret); goto unregister_out;
@@ -863,6 +870,12 @@ static int init_user_pages(struct kgd_mem *mem, uint64_t user_addr) pr_err("%s: Failed to reserve BO\n", __func__); goto release_out; }
- WARN_ON_ONCE(bo->tbo.ttm->page_flags & TTM_PAGE_FLAG_SWAPPED);
- memcpy(bo->tbo.ttm->pages,
pages,
amdgpu_bo_placement_from_domain(bo, mem->domain); ret = ttm_bo_validate(&bo->tbo, &bo->placement, &ctx); if (ret)sizeof(struct page*) * bo->tbo.ttm->num_pages);
@@ -872,6 +885,7 @@ static int init_user_pages(struct kgd_mem *mem, uint64_t user_addr) release_out: amdgpu_ttm_tt_get_user_pages_done(bo->tbo.ttm); unregister_out:
- kvfree(pages); if (ret) amdgpu_mn_unregister(bo); out: