[PATCH v4] drm/amdkfd: Fix partial migrate issue

Chen, Xiaogang xiaogang.chen at amd.com
Mon Jan 13 23:36:25 UTC 2025


On 1/12/2025 8:02 PM, Deng, Emily wrote:
> [AMD Official Use Only - AMD Internal Distribution Only]
>
>> -----Original Message-----
>> From: Chen, Xiaogang <Xiaogang.Chen at amd.com>
>> Sent: Saturday, January 11, 2025 2:13 AM
>> To: Yang, Philip <Philip.Yang at amd.com>; Deng, Emily <Emily.Deng at amd.com>;
>> amd-gfx at lists.freedesktop.org
>> Subject: Re: [PATCH v4] drm/amdkfd: Fix partial migrate issue
>>
>>
>> On 1/10/2025 11:33 AM, Philip Yang wrote:
>>>
>>> On 2025-01-10 11:23, Chen, Xiaogang wrote:
>>>> On 1/10/2025 8:37 AM, Philip Yang wrote:
>>>>>
>>>>> On 2025-01-10 02:49, Emily Deng wrote:
>>>>>> For partial migrate from ram to vram, the migrate->cpages is not
>>>>>> equal to migrate->npages, should use migrate->npages to check all
>>>>>> needed migrate pages which could be copied or not.
>>>>>>
>>>>>> And only need to set those pages could be migrated to
>>>>>> migrate->dst[i], or
>>>>>> the migrate_vma_pages will migrate the wrong pages based on the
>>>>>> migrate->dst[i].
>>>>>>
>>>>>> v2:
>>>>>> Add mpages to break the loop earlier.
>>>>>>
>>>>>> v3:
>>>>>> Uses MIGRATE_PFN_MIGRATE to identify whether page could be migrated.
>>>>> The error handling need below change, with that fixed, this patch is
>>>>>
>>>>> Reviewed-by: Philip Yang<Philip.Yang at amd.com>
>>>>>
>>>>>> Signed-off-by: Emily Deng<Emily.Deng at amd.com>
>>>>>> ---
>>>>>>    drivers/gpu/drm/amd/amdkfd/kfd_migrate.c | 17 ++++++++++-------
>>>>>>    1 file changed, 10 insertions(+), 7 deletions(-)
>>>>>>
>>>>>> diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_migrate.c
>>>>>> b/drivers/gpu/drm/amd/amdkfd/kfd_migrate.c
>>>>>> index 4b275937d05e..bfaccabeb3a0 100644
>>>>>> --- a/drivers/gpu/drm/amd/amdkfd/kfd_migrate.c
>>>>>> +++ b/drivers/gpu/drm/amd/amdkfd/kfd_migrate.c
>>>>>> @@ -278,10 +278,11 @@ svm_migrate_copy_to_vram(struct kfd_node
>>>>>> *node, struct svm_range *prange,
>>>>>>                 struct migrate_vma *migrate, struct dma_fence
>>>>>> **mfence,
>>>>>>                 dma_addr_t *scratch, uint64_t ttm_res_offset)
>>>>>>    {
>>>>>> -    uint64_t npages = migrate->cpages;
>>>>>> +    uint64_t npages = migrate->npages;
>>>>>>        struct amdgpu_device *adev = node->adev;
>>>>>>        struct device *dev = adev->dev;
>>>>>>        struct amdgpu_res_cursor cursor;
>>>>>> +    uint64_t mpages = 0;
>>>>>>        dma_addr_t *src;
>>>>>>        uint64_t *dst;
>>>>>>        uint64_t i, j;
>>>>>> @@ -295,14 +296,16 @@ svm_migrate_copy_to_vram(struct kfd_node
>>>>>> *node, struct svm_range *prange,
>>>>>>          amdgpu_res_first(prange->ttm_res, ttm_res_offset,
>>>>>>                 npages << PAGE_SHIFT, &cursor);
>>>>>> -    for (i = j = 0; i < npages; i++) {
>>>>>> +    for (i = j = 0; (i < npages) && (mpages < migrate->cpages);
>>>>>> i++) {
>>>>>>            struct page *spage;
>>>>>>    -        dst[i] = cursor.start + (j << PAGE_SHIFT);
>>>>>> -        migrate->dst[i] = svm_migrate_addr_to_pfn(adev, dst[i]);
>>>>>> -        svm_migrate_get_vram_page(prange, migrate->dst[i]);
>>>>>> -        migrate->dst[i] = migrate_pfn(migrate->dst[i]);
>>>>>> -
>>>>>> +        if (migrate->src[i] & MIGRATE_PFN_MIGRATE) {
>>>>>> +            dst[i] = cursor.start + (j << PAGE_SHIFT);
>>>>>> +            migrate->dst[i] = svm_migrate_addr_to_pfn(adev,
>>>>>> +dst[i]);
>>>>>> +            svm_migrate_get_vram_page(prange, migrate->dst[i]);
>>>>>> +            migrate->dst[i] = migrate_pfn(migrate->dst[i]);
>>>>>> +            mpages++;
>>>>>> +        }
>>>>>>            spage = migrate_pfn_to_page(migrate->src[i]);
>>>>>>            if (spage && !is_zone_device_page(spage)) {
>>>>>>                src[i] = dma_map_page(dev, spage, 0, PAGE_SIZE,
>>>>>   out_free_vram_pages:
>>>>>          if (r) {
>>>>>                  pr_debug("failed %d to copy memory to vram\n", r);
>>>>> -               while (i--) {
>>>>> +
>>>>> +               for (i = 0; i < npages && mpages; i++) {
>>>>> +                       if (!dst[i])
>>>>> +                               continue;
>>>>>                          svm_migrate_put_vram_page(adev, dst[i]);
>>>>>                          migrate->dst[i] = 0;
>>>>> +                       mpages--;
>>>>>                  }
>>>>>          }
>>>> This error handing not need recover all vram pages as error happened
>>>> at middle. Can use se do {....} while(i--);
>>> no, for example migrate npage=cpage=4, and outside for loop,
>>> svm_migrate_copy_memory_gart failed, dst[4] is out of range access.
>> You are right, but it is awkward. Inside loop we update dst[i] before do sdam that does
>> not include the dst[i] just updated.
> As the dst[i] is already initialized to NULL, so I think it doesn't matter if use follow:
>                  while (i--) {
>                          if (dst[i])
>                                  svm_migrate_put_vram_page(adev, dst[i]);
>                          migrate->dst[i] = 0;
>                  }

The issue at error recovery path is due to we handled sys ram(src) and 
dst(vram) discontinuation in different way. When src got discontinuation 
we migrate j pages that page ith is not migrated. When dst(vram) got 
discontinuation we migrate j+1 pages that page ith is migrated. That 
cause error path has to iterate all pages to find which page got 
migrated before error happened. Also makes code more difficult to read.

Besides your change I think we can also change inside loop that handle 
src and dst discontinuation in consistent way.

> Emily Deng
> Best Wishes
>
>
>> Regard
>>
>> Xiaogang
>>
>>>> Regards
>>>>
>>>> Xiaogang
>>>>


More information about the amd-gfx mailing list