[PATCH v3] drm/amdkfd: Use partial migrations in GPU page faults

Felix Kuehling felix.kuehling at amd.com
Fri Sep 29 18:41:07 UTC 2023


On 2023-09-29 14:25, Chen, Xiaogang wrote:
>
> On 9/28/2023 4:26 PM, Felix Kuehling wrote:
>> On 2023-09-20 13:32, Xiaogang.Chen wrote:
>>> From: Xiaogang Chen <xiaogang.chen at amd.com>
>>>
>>> This patch implements partial migration in gpu page fault according 
>>> to migration
>>> granularity(default 2MB) and not split svm range in cpu page fault 
>>> handling.
>>> A svm range may include pages from both system ram and vram of one 
>>> gpu now.
>>> These chagnes are expected to improve migration performance and 
>>> reduce mmu
>>> callback and TLB flush workloads.
>>>
>>> Signed-off-by: xiaogang chen<xiaogang.chen at amd.com>
>>
>> Some more nit-picks inline.
[snip]
>> @@ -202,12 +200,24 @@ svm_range_dma_map_dev(struct amdgpu_device 
>> *adev, struct svm_range *prange,
>>   static int
>>   svm_range_dma_map(struct svm_range *prange, unsigned long *bitmap,
>>             unsigned long offset, unsigned long npages,
>> -          unsigned long *hmm_pfns)
>> +          unsigned long *hmm_pfns, uint64_t *vram_pages)
>>   {
>>       struct kfd_process *p;
>>       uint32_t gpuidx;
>> +    struct page **page;
>>       int r;
>>   +    page = kvcalloc(npages, sizeof(*page), GFP_KERNEL);
>>
>> Is there a reason this needs to be 0-initialized? The loop below 
>> initializes all elements.
> Will use kvmalloc_array instead. This array is not needed to init 
> during alloc as it will  be set afterword.
>>
>> I'm also not happy about having to allocate this page array here. It 
>> may be justified if the repeated calls to hmm_pfn_to_page were 
>> expensive compared to the memory allocation and initialization. I'm 
>> not convinced that's the case, though. With CONFIG_SPARSEMEM_VMEMMAP, 
>> hmm_pfn_to_page basically boils down to __pfn_to_page, which is a 
>> macro that does just this:
>>
>> #define __pfn_to_page(pfn)    (vmemmap + (pfn))
>>
>> See 
>> https://elixir.bootlin.com/linux/v6.5.5/source/include/asm-generic/memory_model.h#L37.
>>
> We get page* array here, then we do not need get page at 
> svm_range_dma_map_dev for each device since pages are same for all gpu 
> devices. So we save calling hmm_pfn_to_page though calling 
> hmm_pfn_to_page cost is low. Here we also want get vram page number 
> during svm_range_dma_map.
>
> I think you do not want alloc a page array each time we do 
> svm_range_dma_map? then can we put this struct page **page at 
> svm_range, and alloc it each time create new prange(and clone/split 
> prange)?

Allocating and freeing memory in every dmamap call adds a point of 
failure and non-deterministic overhead for the memory allocation. It's 
non-deterministic because it depends on current memory pressure 
situation and if the array is large, it also needs to update the kernel 
page table and potentially flush CPU TLBs.

Allocating the page pointer array permanently in the prange structure 
would be even worse, because that increases our system memory overhead. 
For each GB of virtual address space we manage, we waste 2MB of system 
memory for the page arrays. To make matters worse, those page arrays are 
allocated in kernel mode where they are not swappable. It's like taking 
physical DIMM sticks out of your system. We worked hard to eliminate 
redundant address and page array allocations from the driver a few years 
ago.

I'd prefer a small amount of redundant address calculations over the 
memory or non-deterministic allocation overhead.

Regards,
   Felix


>
>>
>>> +    if (!page)
>>> +        return -ENOMEM;
>>> +
>>> +    *vram_pages = 0;
>>> +    for (int i = 0; i < npages; i++) {
>>> +        page[i] = hmm_pfn_to_page(hmm_pfns[i]);
>>> +        if (is_zone_device_page(page[i]))
>>> +            (*vram_pages)++;
>>> +    }
>>> +
>>>       p = container_of(prange->svms, struct kfd_process, svms);
>>>         for_each_set_bit(gpuidx, bitmap, MAX_GPU_INSTANCE) {
>>> @@ -221,11 +231,12 @@ svm_range_dma_map(struct svm_range *prange, 
>>> unsigned long *bitmap,
>>>           }
>>>             r = svm_range_dma_map_dev(pdd->dev->adev, prange, 
>>> offset, npages,
>>> -                      hmm_pfns, gpuidx);
>>> +                      hmm_pfns, gpuidx, page);
>>>           if (r)
>>>               break;
>>>       }
>>>   +    kvfree(page);
>>>       return r;
>>>   }
>>>   @@ -347,6 +358,7 @@ svm_range *svm_range_new(struct svm_range_list 
>>> *svms, uint64_t start,
>>>       INIT_LIST_HEAD(&prange->child_list);
>>>       atomic_set(&prange->invalid, 0);
>>>       prange->validate_timestamp = 0;
>>> +    prange->vram_pages = 0;
>>>       mutex_init(&prange->migrate_mutex);
>>>       mutex_init(&prange->lock);
>>>   @@ -393,6 +405,8 @@ static void svm_range_bo_release(struct kref 
>>> *kref)
>>>                prange->start, prange->last);
>>>           mutex_lock(&prange->lock);
>>>           prange->svm_bo = NULL;
>>> +        /* prange does not hold vram page now */
>>> +        prange->actual_loc = 0;
>>
>> If you need this here, something probably went wrong elsewhere. 
>> Before we release the BO, we should have migrated everything to 
>> system memory, and actual_loc should already be 0. If anything, I'd 
>> add a WARN_ON(prange->actual_loc) here.
>>
> Will replace by WARN_ON as mentioned before.
>>
>>> mutex_unlock(&prange->lock);
>>>             spin_lock(&svm_bo->list_lock);
>>> @@ -966,6 +980,11 @@ svm_range_split_nodes(struct svm_range *new, 
>>> struct svm_range *old,
>>>       new->svm_bo = svm_range_bo_ref(old->svm_bo);
>>>       new->ttm_res = old->ttm_res;
>>>   +    /* set new's vram_pages as old range's now, the acurate 
>>> vram_pages
>>> +     * will be updated during mapping
>>> +     */
>>> +    new->vram_pages = min(old->vram_pages, new->npages);
>>> +
>>>       spin_lock(&new->svm_bo->list_lock);
>>>       list_add(&new->svm_bo_list, &new->svm_bo->range_list);
>>>       spin_unlock(&new->svm_bo->list_lock);
>>> @@ -1610,6 +1629,7 @@ static int svm_range_validate_and_map(struct 
>>> mm_struct *mm,
>>>       struct svm_validate_context *ctx;
>>>       unsigned long start, end, addr;
>>>       struct kfd_process *p;
>>> +    uint64_t vram_pages;
>>>       void *owner;
>>>       int32_t idx;
>>>       int r = 0;
>>> @@ -1678,11 +1698,13 @@ static int svm_range_validate_and_map(struct 
>>> mm_struct *mm,
>>>           }
>>>       }
>>>   +    vram_pages = 0;
>>>       start = prange->start << PAGE_SHIFT;
>>>       end = (prange->last + 1) << PAGE_SHIFT;
>>>       for (addr = start; addr < end && !r; ) {
>>>           struct hmm_range *hmm_range;
>>>           struct vm_area_struct *vma;
>>> +        uint64_t vram_pages_vma;
>>>           unsigned long next;
>>>           unsigned long offset;
>>>           unsigned long npages;
>>> @@ -1711,12 +1733,14 @@ static int svm_range_validate_and_map(struct 
>>> mm_struct *mm,
>>>             offset = (addr - start) >> PAGE_SHIFT;
>>>           r = svm_range_dma_map(prange, ctx->bitmap, offset, npages,
>>> -                      hmm_range->hmm_pfns);
>>> +                      hmm_range->hmm_pfns, &vram_pages_vma);
>>>           if (r) {
>>>               pr_debug("failed %d to dma map range\n", r);
>>>               goto unreserve_out;
>>>           }
>>>   +        vram_pages += vram_pages_vma;
>>> +
>>>           svm_range_lock(prange);
>>>           if (amdgpu_hmm_range_get_pages_done(hmm_range)) {
>>>               pr_debug("hmm update the range, need validate again\n");
>>> @@ -1738,8 +1762,20 @@ static int svm_range_validate_and_map(struct 
>>> mm_struct *mm,
>>>           addr = next;
>>>       }
>>>   -    if (addr == end)
>>> +    if (addr == end) {
>>>           prange->mapped_to_gpu = true;
>>> +        prange->vram_pages = vram_pages;
>>> +
>>> +        /* if pragne does not include any vram page and it
>>
>> Typo: prange
>>
> ok.
>>
>>> +         * has not released svm_bo drop its svm_bo reference
>>> +         */
>>> +        if (!vram_pages && prange->ttm_res) {
>>> +            svm_range_vram_node_free(prange);
>>> +            mutex_lock(&prange->lock);
>>
>> Is this lock really needed.
>>
> same as before, will remove mutex_lock/unlock(&prange->lock) as it 
> seems overprotected.
>>
>>> +            prange->actual_loc = 0;
>>> +            mutex_unlock(&prange->lock);
>>> +        }
>>> +    }
>>>     unreserve_out:
>>>       svm_range_unreserve_bos(ctx);
>>> @@ -1996,6 +2032,7 @@ static struct svm_range 
>>> *svm_range_clone(struct svm_range *old)
>>>       new->actual_loc = old->actual_loc;
>>>       new->granularity = old->granularity;
>>>       new->mapped_to_gpu = old->mapped_to_gpu;
>>> +    new->vram_pages = old->vram_pages;
>>>       bitmap_copy(new->bitmap_access, old->bitmap_access, 
>>> MAX_GPU_INSTANCE);
>>>       bitmap_copy(new->bitmap_aip, old->bitmap_aip, MAX_GPU_INSTANCE);
>>>   @@ -2903,6 +2940,7 @@ svm_range_restore_pages(struct amdgpu_device 
>>> *adev, unsigned int pasid,
>>>               uint32_t vmid, uint32_t node_id,
>>>               uint64_t addr, bool write_fault)
>>>   {
>>> +    unsigned long start, last, size;
>>>       struct mm_struct *mm = NULL;
>>>       struct svm_range_list *svms;
>>>       struct svm_range *prange;
>>> @@ -3038,32 +3076,38 @@ svm_range_restore_pages(struct amdgpu_device 
>>> *adev, unsigned int pasid,
>>>       kfd_smi_event_page_fault_start(node, p->lead_thread->pid, addr,
>>>                          write_fault, timestamp);
>>>   -    if (prange->actual_loc != best_loc) {
>>> +    if (prange->actual_loc != 0 || best_loc != 0) {
>>>           migration = true;
>>> +        /* Align migration range start and size to granularity size */
>>> +        size = 1UL << prange->granularity;
>>> +        start = ALIGN_DOWN(addr, size);
>>> +        last = ALIGN(addr + 1, size) - 1;
>>> +
>>> +        start = (start >= prange->start) ? start : prange->start;
>>> +        last = (last <= prange->last) ? last : prange->last;
>>
>> Similar to the simplification I recommended in svm_migrate_to_ram, 
>> this could be:
>>
>>     start = max(ALIGN_DOWN(addr, size), prange->start);
>>     last = min(ALIGN(addr + 1, size) - 1, prange->last);
>>
> ok, overlooked last time.
>
> Regards
>
> Xiaogang
>
>> Regards,
>>   Felix
>>
>>
>>> +
>>>           if (best_loc) {
>>> -            r = svm_migrate_to_vram(prange, best_loc, mm,
>>> -                    KFD_MIGRATE_TRIGGER_PAGEFAULT_GPU);
>>> +            r = svm_migrate_to_vram(prange, best_loc, start, last,
>>> +                    mm, KFD_MIGRATE_TRIGGER_PAGEFAULT_GPU);
>>>               if (r) {
>>>                   pr_debug("svm_migrate_to_vram failed (%d) at %llx, 
>>> falling back to system memory\n",
>>>                        r, addr);
>>>                   /* Fallback to system memory if migration to
>>>                    * VRAM failed
>>>                    */
>>> -                if (prange->actual_loc)
>>> -                    r = svm_migrate_vram_to_ram(prange, mm,
>>> -                       KFD_MIGRATE_TRIGGER_PAGEFAULT_GPU,
>>> -                       NULL);
>>> +                if (prange->actual_loc && prange->actual_loc != 
>>> best_loc)
>>> +                    r = svm_migrate_vram_to_ram(prange, mm, start, 
>>> last,
>>> +                        KFD_MIGRATE_TRIGGER_PAGEFAULT_GPU, NULL);
>>>                   else
>>>                       r = 0;
>>>               }
>>>           } else {
>>> -            r = svm_migrate_vram_to_ram(prange, mm,
>>> -                    KFD_MIGRATE_TRIGGER_PAGEFAULT_GPU,
>>> -                    NULL);
>>> +            r = svm_migrate_vram_to_ram(prange, mm, start, last,
>>> +                    KFD_MIGRATE_TRIGGER_PAGEFAULT_GPU, NULL);
>>>           }
>>>           if (r) {
>>>               pr_debug("failed %d to migrate svms %p [0x%lx 0x%lx]\n",
>>> -                 r, svms, prange->start, prange->last);
>>> +                r, svms, start, last);
>>>               goto out_unlock_range;
>>>           }
>>>       }
>>> @@ -3417,18 +3461,24 @@ svm_range_trigger_migration(struct mm_struct 
>>> *mm, struct svm_range *prange,
>>>       *migrated = false;
>>>       best_loc = svm_range_best_prefetch_location(prange);
>>>   -    if (best_loc == KFD_IOCTL_SVM_LOCATION_UNDEFINED ||
>>> -        best_loc == prange->actual_loc)
>>> +    /* when best_loc is a gpu node and same as prange->actual_loc
>>> +     * we still need do migration as prange->actual_loc !=0 does
>>> +     * not mean all pages in prange are vram. hmm migrate will pick
>>> +     * up right pages during migration.
>>> +     */
>>> +    if ((best_loc == KFD_IOCTL_SVM_LOCATION_UNDEFINED) ||
>>> +        (best_loc == 0 && prange->actual_loc == 0))
>>>           return 0;
>>>         if (!best_loc) {
>>> -        r = svm_migrate_vram_to_ram(prange, mm,
>>> +        r = svm_migrate_vram_to_ram(prange, mm, prange->start, 
>>> prange->last,
>>>                       KFD_MIGRATE_TRIGGER_PREFETCH, NULL);
>>>           *migrated = !r;
>>>           return r;
>>>       }
>>>   -    r = svm_migrate_to_vram(prange, best_loc, mm, 
>>> KFD_MIGRATE_TRIGGER_PREFETCH);
>>> +    r = svm_migrate_to_vram(prange, best_loc, prange->start, 
>>> prange->last,
>>> +                    mm, KFD_MIGRATE_TRIGGER_PREFETCH);
>>>       *migrated = !r;
>>>         return r;
>>> @@ -3483,7 +3533,11 @@ static void 
>>> svm_range_evict_svm_bo_worker(struct work_struct *work)
>>>             mutex_lock(&prange->migrate_mutex);
>>>           do {
>>> +            /* migrate all vram pages in this prange to sys ram
>>> +             * after that prange->actual_loc should be zero
>>> +             */
>>>               r = svm_migrate_vram_to_ram(prange, mm,
>>> +                    prange->start, prange->last,
>>>                       KFD_MIGRATE_TRIGGER_TTM_EVICTION, NULL);
>>>           } while (!r && prange->actual_loc && --retries);
>>>   diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_svm.h 
>>> b/drivers/gpu/drm/amd/amdkfd/kfd_svm.h
>>> index 5fd958a97a28..8574cc5eeabd 100644
>>> --- a/drivers/gpu/drm/amd/amdkfd/kfd_svm.h
>>> +++ b/drivers/gpu/drm/amd/amdkfd/kfd_svm.h
>>> @@ -78,6 +78,7 @@ struct svm_work_list_item {
>>>    * @update_list:link list node used to add to update_list
>>>    * @mapping:    bo_va mapping structure to create and update GPU 
>>> page table
>>>    * @npages:     number of pages
>>> + * @vram_pages: vram pages number in this svm_range
>>>    * @dma_addr:   dma mapping address on each GPU for system memory 
>>> physical page
>>>    * @ttm_res:    vram ttm resource map
>>>    * @offset:     range start offset within mm_nodes
>>> @@ -88,7 +89,9 @@ struct svm_work_list_item {
>>>    * @flags:      flags defined as KFD_IOCTL_SVM_FLAG_*
>>>    * @perferred_loc: perferred location, 0 for CPU, or GPU id
>>>    * @perfetch_loc: last prefetch location, 0 for CPU, or GPU id
>>> - * @actual_loc: the actual location, 0 for CPU, or GPU id
>>> + * @actual_loc: this svm_range location. 0: all pages are from sys 
>>> ram;
>>> + *              GPU id: this svm_range may include vram pages from 
>>> GPU with
>>> + *              id actual_loc.
>>>    * @granularity:migration granularity, log2 num pages
>>>    * @invalid:    not 0 means cpu page table is invalidated
>>>    * @validate_timestamp: system timestamp when range is validated
>>> @@ -112,6 +115,7 @@ struct svm_range {
>>>       struct list_head        list;
>>>       struct list_head        update_list;
>>>       uint64_t            npages;
>>> +    uint64_t            vram_pages;
>>>       dma_addr_t            *dma_addr[MAX_GPU_INSTANCE];
>>>       struct ttm_resource        *ttm_res;
>>>       uint64_t            offset;


More information about the amd-gfx mailing list