[PATCH] drm/amdkfd: Fix partial migrate issue
Deng, Emily
Emily.Deng at amd.com
Wed Jan 8 00:31:23 UTC 2025
[AMD Official Use Only - AMD Internal Distribution Only]
From: Yang, Philip <Philip.Yang at amd.com>
Sent: Tuesday, January 7, 2025 10:04 PM
To: Deng, Emily <Emily.Deng at amd.com>; Yang, Philip <Philip.Yang at amd.com>; amd-gfx at lists.freedesktop.org
Subject: Re: [PATCH] drm/amdkfd: Fix partial migrate issue
On 2025-01-06 21:31, Deng, Emily wrote:
[AMD Official Use Only - AMD Internal Distribution Only]
From: Yang, Philip <Philip.Yang at amd.com><mailto:Philip.Yang at amd.com>
Sent: Tuesday, January 7, 2025 7:51 AM
To: Deng, Emily <Emily.Deng at amd.com><mailto:Emily.Deng at amd.com>; amd-gfx at lists.freedesktop.org<mailto:amd-gfx at lists.freedesktop.org>
Subject: Re: [PATCH] drm/amdkfd: Fix partial migrate issue
On 2025-01-02 19:06, Emily Deng wrote:
For partial migrate from ram to vram, the migrate->cpages is not
equal to migrate->npages, should use migrate->npages to check all needed
migrate pages which could be copied or not.
And only need to set those pages could be migrated to migrate->dst[i], or
the migrate_vma_pages will migrate the wrong pages based on the migrate->dst[i].
Signed-off-by: Emily Deng <Emily.Deng at amd.com><mailto:Emily.Deng at amd.com>
---
drivers/gpu/drm/amd/amdkfd/kfd_migrate.c | 8 ++++----
1 file changed, 4 insertions(+), 4 deletions(-)
diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_migrate.c b/drivers/gpu/drm/amd/amdkfd/kfd_migrate.c
index 4b275937d05e..5c96c2d425e3 100644
--- a/drivers/gpu/drm/amd/amdkfd/kfd_migrate.c
+++ b/drivers/gpu/drm/amd/amdkfd/kfd_migrate.c
@@ -278,7 +278,7 @@ svm_migrate_copy_to_vram(struct kfd_node *node, struct svm_range *prange,
struct migrate_vma *migrate, struct dma_fence **mfence,
dma_addr_t *scratch, uint64_t ttm_res_offset)
{
- uint64_t npages = migrate->cpages;
+ uint64_t npages = migrate->npages;
As partial migration size is based on prange granularity, by default 2MB, maybe always migrate->cpages equal to migrate->npages, that's why we didn't trigger this bug. Wondering how do you catch this bug? This bug will leak svm_bo too, as svm_migrate_get_vram_page reference counter is incorrect.
struct amdgpu_device *adev = node->adev;
struct device *dev = adev->dev;
struct amdgpu_res_cursor cursor;
@@ -299,9 +299,6 @@ svm_migrate_copy_to_vram(struct kfd_node *node, struct svm_range *prange,
struct page *spage;
dst[i] = cursor.start + (j << PAGE_SHIFT);
- migrate->dst[i] = svm_migrate_addr_to_pfn(adev, dst[i]);
- svm_migrate_get_vram_page(prange, migrate->dst[i]);
- migrate->dst[i] = migrate_pfn(migrate->dst[i]);
spage = migrate_pfn_to_page(migrate->src[i]);
if (spage && !is_zone_device_page(spage)) {
@@ -345,6 +342,9 @@ svm_migrate_copy_to_vram(struct kfd_node *node, struct svm_range *prange,
} else {
j++;
}
+ migrate->dst[i] = svm_migrate_addr_to_pfn(adev, dst[i]);
+ svm_migrate_get_vram_page(prange, migrate->dst[i]);
+ migrate->dst[i] = migrate_pfn(migrate->dst[i]);
This should move forward, to handle the corner case to migrate 1 page to the last page of VRAM res cursor.
Sorry, don’t understand what is the corner case? And the place you put has not much different with mine, as it doesn’t have any jump or continue in the code between your
migrate->dst[i] should set before this if condition (not after if) to setup sdma copy migrate->src to migrate->dst, if condition is true for the case migrating to the last page of VRAM res cusor, as the VRAM physical address is not contiguous.
if (j >= (cursor.size >> PAGE_SHIFT) - 1 && i < npages - 1) {
Regards,
Philip
Maybe I am wrong, but I couldn’t find anywhere in svm_migrate_copy_memory_gart will use migrate->dst[i], it uses the dst[i] directly. So, I think it will have no effect. The reason why I put it to the tail is because if have some exceptions, for example “if (r) goto out_free_vram_pages;”, for this case, I think we need to treat the copy as fail, and couldn’t migrate these pages, so also couldn’t set the migrate->dst[i].
Emily Deng
Best Wishes
code place and mine. But it has “if (r) goto out_free_vram_pages;” to handle the error case. For error case, I think the migrate->dst[i] also need to be set to null.
Please check this change, to add mpages accounting to break the loop earlier.
Good suggestion, I will add mpages in change.
Emily Deng
Best Wishes
- uint64_t npages = migrate->cpages;
+ uint64_t npages = migrate->npages;
struct amdgpu_device *adev = node->adev;
struct device *dev = adev->dev;
struct amdgpu_res_cursor cursor;
+ uint64_t mpages = 0;
dma_addr_t *src;
uint64_t *dst;
uint64_t i, j;
@@ -295,14 +296,9 @@ svm_migrate_copy_to_vram(struct kfd_node *node, struct svm_range *prange,
amdgpu_res_first(prange->ttm_res, ttm_res_offset,
npages << PAGE_SHIFT, &cursor);
- for (i = j = 0; i < npages; i++) {
+ for (i = j = 0; i < npages && mpages < migrate->cpages; i++) {
struct page *spage;
- dst[i] = cursor.start + (j << PAGE_SHIFT);
- migrate->dst[i] = svm_migrate_addr_to_pfn(adev, dst[i]);
- svm_migrate_get_vram_page(prange, migrate->dst[i]);
- migrate->dst[i] = migrate_pfn(migrate->dst[i]);
-
spage = migrate_pfn_to_page(migrate->src[i]);
if (spage && !is_zone_device_page(spage)) {
src[i] = dma_map_page(dev, spage, 0, PAGE_SIZE,
@@ -322,6 +318,7 @@ svm_migrate_copy_to_vram(struct kfd_node *node, struct svm_range *prange,
mfence);
if (r)
goto out_free_vram_pages;
+ mpages += j;
amdgpu_res_next(&cursor, (j + 1) << PAGE_SHIFT);
j = 0;
} else {
@@ -333,6 +330,11 @@ svm_migrate_copy_to_vram(struct kfd_node *node, struct svm_range *prange,
pr_debug_ratelimited("dma mapping src to 0x%llx, pfn 0x%lx\n",
src[i] >> PAGE_SHIFT, page_to_pfn(spage));
+ dst[i] = cursor.start + (j << PAGE_SHIFT);
+ migrate->dst[i] = svm_migrate_addr_to_pfn(adev, dst[i]);
+ svm_migrate_get_vram_page(prange, migrate->dst[i]);
+ migrate->dst[i] = migrate_pfn(migrate->dst[i]);
+
if (j >= (cursor.size >> PAGE_SHIFT) - 1 && i < npages - 1) {
r = svm_migrate_copy_memory_gart(adev, src + i - j,
dst + i - j, j + 1,
@@ -340,6 +342,7 @@ svm_migrate_copy_to_vram(struct kfd_node *node, struct svm_range *prange,
mfence);
if (r)
goto out_free_vram_pages;
+ mpages += j + 1;
amdgpu_res_next(&cursor, (j + 1) * PAGE_SIZE);
j = 0;
} else {
(END)
@@ -322,6 +318,7 @@ svm_migrate_copy_to_vram(struct kfd_node *node, struct svm_range *prange,
mfence);
if (r)
goto out_free_vram_pages;
+ mpages += j;
amdgpu_res_next(&cursor, (j + 1) << PAGE_SHIFT);
j = 0;
} else {
@@ -333,6 +330,11 @@ svm_migrate_copy_to_vram(struct kfd_node *node, struct svm_range *prange,
pr_debug_ratelimited("dma mapping src to 0x%llx, pfn 0x%lx\n",
src[i] >> PAGE_SHIFT, page_to_pfn(spage));
+ dst[i] = cursor.start + (j << PAGE_SHIFT);
+ migrate->dst[i] = svm_migrate_addr_to_pfn(adev, dst[i]);
+ svm_migrate_get_vram_page(prange, migrate->dst[i]);
+ migrate->dst[i] = migrate_pfn(migrate->dst[i]);
+
if (j >= (cursor.size >> PAGE_SHIFT) - 1 && i < npages - 1) {
r = svm_migrate_copy_memory_gart(adev, src + i - j,
dst + i - j, j + 1,
@@ -340,6 +342,7 @@ svm_migrate_copy_to_vram(struct kfd_node *node, struct svm_range *prange,
mfence);
if (r)
goto out_free_vram_pages;
+ mpages += j + 1;
amdgpu_res_next(&cursor, (j + 1) * PAGE_SIZE);
j = 0;
} else {
}
r = svm_migrate_copy_memory_gart(adev, src + i - j, dst + i - j, j,
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://lists.freedesktop.org/archives/amd-gfx/attachments/20250108/df299b5a/attachment-0001.htm>
More information about the amd-gfx
mailing list