[PATCH v4] drm/amdkfd: Fix partial migrate issue
Chen, Xiaogang
xiaogang.chen at amd.com
Mon Jan 13 23:36:25 UTC 2025
On 1/12/2025 8:02 PM, Deng, Emily wrote:
> [AMD Official Use Only - AMD Internal Distribution Only]
>
>> -----Original Message-----
>> From: Chen, Xiaogang <Xiaogang.Chen at amd.com>
>> Sent: Saturday, January 11, 2025 2:13 AM
>> To: Yang, Philip <Philip.Yang at amd.com>; Deng, Emily <Emily.Deng at amd.com>;
>> amd-gfx at lists.freedesktop.org
>> Subject: Re: [PATCH v4] drm/amdkfd: Fix partial migrate issue
>>
>>
>> On 1/10/2025 11:33 AM, Philip Yang wrote:
>>>
>>> On 2025-01-10 11:23, Chen, Xiaogang wrote:
>>>> On 1/10/2025 8:37 AM, Philip Yang wrote:
>>>>>
>>>>> On 2025-01-10 02:49, Emily Deng wrote:
>>>>>> For partial migrate from ram to vram, the migrate->cpages is not
>>>>>> equal to migrate->npages, should use migrate->npages to check all
>>>>>> needed migrate pages which could be copied or not.
>>>>>>
>>>>>> And only need to set those pages could be migrated to
>>>>>> migrate->dst[i], or
>>>>>> the migrate_vma_pages will migrate the wrong pages based on the
>>>>>> migrate->dst[i].
>>>>>>
>>>>>> v2:
>>>>>> Add mpages to break the loop earlier.
>>>>>>
>>>>>> v3:
>>>>>> Uses MIGRATE_PFN_MIGRATE to identify whether page could be migrated.
>>>>> The error handling need below change, with that fixed, this patch is
>>>>>
>>>>> Reviewed-by: Philip Yang<Philip.Yang at amd.com>
>>>>>
>>>>>> Signed-off-by: Emily Deng<Emily.Deng at amd.com>
>>>>>> ---
>>>>>> drivers/gpu/drm/amd/amdkfd/kfd_migrate.c | 17 ++++++++++-------
>>>>>> 1 file changed, 10 insertions(+), 7 deletions(-)
>>>>>>
>>>>>> diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_migrate.c
>>>>>> b/drivers/gpu/drm/amd/amdkfd/kfd_migrate.c
>>>>>> index 4b275937d05e..bfaccabeb3a0 100644
>>>>>> --- a/drivers/gpu/drm/amd/amdkfd/kfd_migrate.c
>>>>>> +++ b/drivers/gpu/drm/amd/amdkfd/kfd_migrate.c
>>>>>> @@ -278,10 +278,11 @@ svm_migrate_copy_to_vram(struct kfd_node
>>>>>> *node, struct svm_range *prange,
>>>>>> struct migrate_vma *migrate, struct dma_fence
>>>>>> **mfence,
>>>>>> dma_addr_t *scratch, uint64_t ttm_res_offset)
>>>>>> {
>>>>>> - uint64_t npages = migrate->cpages;
>>>>>> + uint64_t npages = migrate->npages;
>>>>>> struct amdgpu_device *adev = node->adev;
>>>>>> struct device *dev = adev->dev;
>>>>>> struct amdgpu_res_cursor cursor;
>>>>>> + uint64_t mpages = 0;
>>>>>> dma_addr_t *src;
>>>>>> uint64_t *dst;
>>>>>> uint64_t i, j;
>>>>>> @@ -295,14 +296,16 @@ svm_migrate_copy_to_vram(struct kfd_node
>>>>>> *node, struct svm_range *prange,
>>>>>> amdgpu_res_first(prange->ttm_res, ttm_res_offset,
>>>>>> npages << PAGE_SHIFT, &cursor);
>>>>>> - for (i = j = 0; i < npages; i++) {
>>>>>> + for (i = j = 0; (i < npages) && (mpages < migrate->cpages);
>>>>>> i++) {
>>>>>> struct page *spage;
>>>>>> - dst[i] = cursor.start + (j << PAGE_SHIFT);
>>>>>> - migrate->dst[i] = svm_migrate_addr_to_pfn(adev, dst[i]);
>>>>>> - svm_migrate_get_vram_page(prange, migrate->dst[i]);
>>>>>> - migrate->dst[i] = migrate_pfn(migrate->dst[i]);
>>>>>> -
>>>>>> + if (migrate->src[i] & MIGRATE_PFN_MIGRATE) {
>>>>>> + dst[i] = cursor.start + (j << PAGE_SHIFT);
>>>>>> + migrate->dst[i] = svm_migrate_addr_to_pfn(adev,
>>>>>> +dst[i]);
>>>>>> + svm_migrate_get_vram_page(prange, migrate->dst[i]);
>>>>>> + migrate->dst[i] = migrate_pfn(migrate->dst[i]);
>>>>>> + mpages++;
>>>>>> + }
>>>>>> spage = migrate_pfn_to_page(migrate->src[i]);
>>>>>> if (spage && !is_zone_device_page(spage)) {
>>>>>> src[i] = dma_map_page(dev, spage, 0, PAGE_SIZE,
>>>>> out_free_vram_pages:
>>>>> if (r) {
>>>>> pr_debug("failed %d to copy memory to vram\n", r);
>>>>> - while (i--) {
>>>>> +
>>>>> + for (i = 0; i < npages && mpages; i++) {
>>>>> + if (!dst[i])
>>>>> + continue;
>>>>> svm_migrate_put_vram_page(adev, dst[i]);
>>>>> migrate->dst[i] = 0;
>>>>> + mpages--;
>>>>> }
>>>>> }
>>>> This error handing not need recover all vram pages as error happened
>>>> at middle. Can use se do {....} while(i--);
>>> no, for example migrate npage=cpage=4, and outside for loop,
>>> svm_migrate_copy_memory_gart failed, dst[4] is out of range access.
>> You are right, but it is awkward. Inside loop we update dst[i] before do sdam that does
>> not include the dst[i] just updated.
> As the dst[i] is already initialized to NULL, so I think it doesn't matter if use follow:
> while (i--) {
> if (dst[i])
> svm_migrate_put_vram_page(adev, dst[i]);
> migrate->dst[i] = 0;
> }
The issue at error recovery path is due to we handled sys ram(src) and
dst(vram) discontinuation in different way. When src got discontinuation
we migrate j pages that page ith is not migrated. When dst(vram) got
discontinuation we migrate j+1 pages that page ith is migrated. That
cause error path has to iterate all pages to find which page got
migrated before error happened. Also makes code more difficult to read.
Besides your change I think we can also change inside loop that handle
src and dst discontinuation in consistent way.
> Emily Deng
> Best Wishes
>
>
>> Regard
>>
>> Xiaogang
>>
>>>> Regards
>>>>
>>>> Xiaogang
>>>>
More information about the amd-gfx
mailing list