[PATCH 1/1] drm/amdkfd: Fix partial migration bugs

Felix Kuehling felix.kuehling at amd.com
Fri Jun 3 14:37:31 UTC 2022


Am 2022-06-03 um 09:59 schrieb Philip Yang:
> Migration range from system memory to VRAM, if system page can not be
> locked or unmapped, we do partial migration and leave some pages in
> system memory. Several bugs found to copy pages and update GPU mapping
> for this situation:
>
> 1. copy to vram should use migrate->npage which is total pages of range
> as npages, not migrate->cpages which is number of pages can be migrated.
>
> 2. After partial copy, set VRAM res cursor as j + 1, j is number of
> system pages copied plus 1 page to skip copy.
>
> 3. copy to ram, should collect all continuous VRAM pages and copy
> together.
>
> 4. Call amdgpu_vm_update_range, should pass in offset as bytes, not
> as number of pages.

Thanks for catching all these. It's unfortunate that partial migrations 
are currently so hard to trigger deliberately, which makes this code 
path hard to test. Basically currently it is only used when page locking 
fails due to race conditions.

Xiaogang is working on a change to use partial migration when migrating 
large ranges in GPU page faults. Heads up, Xiaogang, you need this 
patch. And once your changes are ready it should become possible to 
write a KFDTest that checks data integrity with partial migrations.

One comment inline.


>
> Signed-off-by: Philip Yang <Philip.Yang at amd.com>
> ---
>   drivers/gpu/drm/amd/amdkfd/kfd_migrate.c | 6 +++---
>   drivers/gpu/drm/amd/amdkfd/kfd_svm.c     | 2 +-
>   2 files changed, 4 insertions(+), 4 deletions(-)
>
> diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_migrate.c b/drivers/gpu/drm/amd/amdkfd/kfd_migrate.c
> index 43cd47723946..d03135a1f7e6 100644
> --- a/drivers/gpu/drm/amd/amdkfd/kfd_migrate.c
> +++ b/drivers/gpu/drm/amd/amdkfd/kfd_migrate.c
> @@ -296,7 +296,7 @@ svm_migrate_copy_to_vram(struct amdgpu_device *adev, struct svm_range *prange,
>   			 struct migrate_vma *migrate, struct dma_fence **mfence,
>   			 dma_addr_t *scratch)
>   {
> -	uint64_t npages = migrate->cpages;
> +	uint64_t npages = migrate->npages;
>   	struct device *dev = adev->dev;
>   	struct amdgpu_res_cursor cursor;
>   	dma_addr_t *src;
> @@ -344,7 +344,7 @@ svm_migrate_copy_to_vram(struct amdgpu_device *adev, struct svm_range *prange,
>   						mfence);
>   				if (r)
>   					goto out_free_vram_pages;
> -				amdgpu_res_next(&cursor, j << PAGE_SHIFT);
> +				amdgpu_res_next(&cursor, (j + 1) << PAGE_SHIFT);
>   				j = 0;
>   			} else {
>   				amdgpu_res_next(&cursor, PAGE_SIZE);
> @@ -590,7 +590,7 @@ svm_migrate_copy_to_ram(struct amdgpu_device *adev, struct svm_range *prange,
>   			continue;
>   		}
>   		src[i] = svm_migrate_addr(adev, spage);
> -		if (i > 0 && src[i] != src[i - 1] + PAGE_SIZE) {
> +		if (i > 0 && src[i] != src[i - 1] + PAGE_SIZE && j) {

I think the i > 0 condition is redundant now. You could just replace i > 
0 with j > 0.

With that fixed, the patch is

Reviewed-by: Felix Kuehling <Felix.Kuehling at amd.com>


>   			r = svm_migrate_copy_memory_gart(adev, dst + i - j,
>   							 src + i - j, j,
>   							 FROM_VRAM_TO_RAM,
> diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_svm.c b/drivers/gpu/drm/amd/amdkfd/kfd_svm.c
> index 2ba3de0fb8aa..d6fc00d51c8c 100644
> --- a/drivers/gpu/drm/amd/amdkfd/kfd_svm.c
> +++ b/drivers/gpu/drm/amd/amdkfd/kfd_svm.c
> @@ -1295,7 +1295,7 @@ svm_range_map_to_gpu(struct kfd_process_device *pdd, struct svm_range *prange,
>   		r = amdgpu_vm_update_range(adev, vm, false, false, flush_tlb, NULL,
>   					   last_start, prange->start + i,
>   					   pte_flags,
> -					   last_start - prange->start,
> +					   (last_start - prange->start) << PAGE_SHIFT,
>   					   bo_adev ? bo_adev->vm_manager.vram_base_offset : 0,
>   					   NULL, dma_addr, &vm->last_update);
>   


More information about the amd-gfx mailing list