[v2 28/31] drm/xe/svm: Introduce helper to migrate vma to vram
Zeng, Oak
oak.zeng at intel.com
Fri Apr 12 21:21:04 UTC 2024
> -----Original Message-----
> From: Brost, Matthew <matthew.brost at intel.com>
> Sent: Wednesday, April 10, 2024 10:49 PM
> To: Zeng, Oak <oak.zeng at intel.com>
> Cc: intel-xe at lists.freedesktop.org; Ghimiray, Himal Prasad
> <himal.prasad.ghimiray at intel.com>; Bommu, Krishnaiah
> <krishnaiah.bommu at intel.com>; Thomas.Hellstrom at linux.intel.com; Welty,
> Brian <brian.welty at intel.com>
> Subject: Re: [v2 28/31] drm/xe/svm: Introduce helper to migrate vma to vram
>
> On Tue, Apr 09, 2024 at 04:17:39PM -0400, Oak Zeng wrote:
> > Introduce a helper function xe_svm_migrate_vma_to_vram.
> >
> > Since the source pages of the svm range can be physically not
> > contiguous, and the destination vram pages can also be not
> > contiguous, there is no easy way to migrate multiple pages per
> > blitter command. We do page by page migration for now.
> >
> > Migration is best effort. Even if we fail to migrate some pages,
> > we will try to migrate the rest pages.
> >
> > FIXME: Use one blitter command to copy when both src and dst are
> > physically contiguous
> >
>
> Yep, touch in this throughout the series. Only vram needs to be
> contiguous though as we dynamically create PT mappings for sram pages in
> the migrate code. Getting this in a must and should be done immediately
> IMO as this a very, very basic perform thing we know needs to be done.
> We will likely have to tune this code quite a bit for performance so
> getting known things done would be helpful.
>
> > FIXME: when a vma is partially migrated, split vma as we assume
> > no mixture vma placement.
> >
>
> Agree we do not want support partial migrations. We likely want to
> return -EAGAIN for something and fault back to a smaller xe_vma chunk
> size which I discussed in [1] and add comment on in [2].
>
> Migration should be best case too if we fail migrate we can always leave
> backing store in sram too.
>
> I do have question though, when do we get partial migrations? A user
> having called mlock on some of the pages? I just want to make sure I
> fully understand that case.
Yah, mlock could be one case...
I also looked the hmm code. There are few other cases where MIGRATE_PFN_MIGRATE is not set (so we skip migration after), such as pte is NULL and vma is file-backed (not anonymous); entry is swapped out to hard disk etc. see more details in function migrate_vma_collect_pmd.
>
> [1] https://patchwork.freedesktop.org/patch/588526/?series=132229&rev=1
> [2] https://patchwork.freedesktop.org/patch/588528/?series=132229&rev=1
>
> > Signed-off-by: Oak Zeng <oak.zeng at intel.com>
> > Co-developed-by: Niranjana Vishwanathapura
> <niranjana.vishwanathapura at intel.com>
> > Signed-off-by: Niranjana Vishwanathapura
> <niranjana.vishwanathapura at intel.com>
> > Cc: Matthew Brost <matthew.brost at intel.com>
> > Cc: Thomas Hellström <thomas.hellstrom at intel.com>
> > Cc: Brian Welty <brian.welty at intel.com>
> > ---
> > drivers/gpu/drm/xe/xe_svm.h | 2 +
> > drivers/gpu/drm/xe/xe_svm_migrate.c | 115
> ++++++++++++++++++++++++++++
>
> Same comment on file structure throughout the series apply here too.
>
> > 2 files changed, 117 insertions(+)
> >
> > diff --git a/drivers/gpu/drm/xe/xe_svm.h b/drivers/gpu/drm/xe/xe_svm.h
> > index c9e4239c44b4..18ce2e3757c5 100644
> > --- a/drivers/gpu/drm/xe/xe_svm.h
> > +++ b/drivers/gpu/drm/xe/xe_svm.h
> > @@ -83,4 +83,6 @@ int xe_devm_alloc_pages(struct xe_tile *tile,
> > void xe_devm_free_blocks(struct list_head *blocks);
> > void xe_devm_page_free(struct page *page);
> > vm_fault_t xe_svm_migrate_to_sram(struct vm_fault *vmf);
> > +int xe_svm_migrate_vma_to_vram(struct xe_vm *vm, struct xe_vma *vma,
> > + struct xe_tile *tile);
> > #endif
> > diff --git a/drivers/gpu/drm/xe/xe_svm_migrate.c
> b/drivers/gpu/drm/xe/xe_svm_migrate.c
> > index 0db831af098e..ab8dd1f58aa4 100644
> > --- a/drivers/gpu/drm/xe/xe_svm_migrate.c
> > +++ b/drivers/gpu/drm/xe/xe_svm_migrate.c
> > @@ -220,3 +220,118 @@ vm_fault_t xe_svm_migrate_to_sram(struct
> vm_fault *vmf)
> > kvfree(buf);
> > return 0;
> > }
> > +
> > +/**
> > + * xe_svm_migrate_vma_to_vram() - migrate backing store of a vma to
> vram
> > + * Must be called with mmap_read_lock held.
> > + * @vm: the vm that the vma belongs to
> > + * @vma: the vma to migrate.
> > + * @tile: the destination tile which holds the new backing store of the range
> > + *
> > + * Returns: negative errno on faiure, 0 on success
> > + */
> > +int xe_svm_migrate_vma_to_vram(struct xe_vm *vm,
> > + struct xe_vma *vma,
> > + struct xe_tile *tile)
> > +{
> > + struct mm_struct *mm = vm->mm;
> > + unsigned long start = xe_vma_start(vma);
> > + unsigned long end = xe_vma_end(vma);
> > + unsigned long npages = (end - start) >> PAGE_SHIFT;
> > + struct xe_mem_region *mr = &tile->mem.vram;
> > + struct vm_area_struct *vas;
> > +
> > + struct migrate_vma migrate = {
> > + .start = start,
> > + .end = end,
> > + .pgmap_owner = tile->xe,
>
> Again helper to assign owner.
>
> > + .flags = MIGRATE_VMA_SELECT_SYSTEM,
> > + };
> > + struct device *dev = tile->xe->drm.dev;
> > + dma_addr_t *src_dma_addr;
> > + struct dma_fence *fence;
> > + struct page *src_page;
> > + LIST_HEAD(blocks);
> > + int ret = 0, i;
> > + u64 dst_dpa;
> > + void *buf;
> > +
> > + mmap_assert_locked(mm);
>
> This mmap_assert_locked is ambiguous, we should make it clear if this
> read or write locked. Doesn't it have to be write to do the migrate
> pages?
I followed hmm document (Documents/mm/hmm.rst), see section "Migration to and from device memory". It explicitly write a read_lock in this document.
I believe a read_lock is enough for the migrate_vma_setup/migrate_vma_finalize().
As I understand it, the mm.mmap_lock protect the process's address space. When we modify process's address space such as mmap/munmap, we need to hold a write mode lock; if we only read process's address space, such as in the migrate_vma_setup/finalize, or in the cpu page fault handler case, we only need a read mode lock.
I also cross checked amd driver. They also use a read lock.. see function svm_range_restore_pages in kfd_svm.c....
>
> A larger question about the locking... The CPU fault handler holds the
> mmap lock in write mode, right?
No. since cpu fault handler doesn't modify process address space, instead it only fill up cpu page table for some valid address range, a read lock is enough.
>
> I'm asking as basically I think at least initially we want to hold the
> mmap lock in a way that the GPU handler and CPU handler do not race.
> i.e. From fault userptr create in GPU fault handler to issuing the bind
> we prevent the CPU fault handler from running.
Yes we hold mmap_read_lock in both cpu and gpu fault handler to avoid that race.
In user mmap/munmap (such as kernel function vm_munmap), we hold a mmap_write_lock which prevent it racing with cpu and gpu fault handler.
>
> I'm having issues figuring out how to prevent races between initial
> binds of userptrs and userptr invalidates on faulting VMs. This race is
> seen any xe_exec_fault_mode for example... So preventing races between
> CPU / GPU fault handler with the mmap probably is a good idea initially.
> Likely can make the locking finer grained once this is all working and I
> figure out how to handle this race better.
I don't quite follow here.
Initial bind of user ptr... if you meant the bind in gpu page fault handler, then the racing with invalidation is roughly like below:
Invalidation is called with mmap_write_lock
In userptr_pin_page, we hold a mmap_read_lock, so we know during pin page, invalidation is excluded.
After pin, before programming gpu page table, we check whether there is invalidation happen *after last pin but before programming page table*, if that happened, we retry
Oak
>
> > +
> > + vas = find_vma_intersection(mm, start, start + 4);
>
> find_vma should work fine here.
>
> > + if (!vas)
> > + return -ENOENT;
> > +
> > + migrate.vma = vas;
> > + buf = kvcalloc(npages, 2* sizeof(*migrate.src) + sizeof(*src_dma_addr),
> > + GFP_KERNEL);
> > + if(!buf)
> > + return -ENOMEM;
> > + migrate.src = buf;
> > + migrate.dst = migrate.src + npages;
> > + src_dma_addr = (dma_addr_t *) (migrate.dst + npages);
> > + ret = xe_devm_alloc_pages(tile, npages, &blocks, migrate.dst);
>
> Again as I discussed in [3] I think this should be broken out into a
> different step with the blocks allocated before this, and here just
> populate migrate.dst from the existing blocks.
>
> [3] https://patchwork.freedesktop.org/patch/588523/?series=132229&rev=1
>
> > + if (ret)
> > + goto kfree_buf;
> > +
> > + ret = migrate_vma_setup(&migrate);
> > + if (ret) {
> > + drm_err(&tile->xe->drm, "vma setup returned %d for range
> [%lx - %lx]\n",
> > + ret, start, end);
> > + goto free_dst_pages;
> > + }
> > +
> > + /**FIXME: partial migration of a range print a warning for now.
> > + * If this message is printed, we need to split xe_vma as we
> > + * don't support a mixture placement of one vma
> > + */
> > + if (migrate.cpages != npages)
> > + drm_warn(&tile->xe->drm, "Partial migration for range [%lx -
> %lx], range is %ld pages, migrate only %ld pages\n",
> > + start, end, npages, migrate.cpages);
>
> As discussed above, we shouldn't support this. We should fall back to
> smaller xe_vma chunk size until we find one that works or simply leave
> the pages in sram and map those pages to GPU.
>
> > +
> > + /**Migrate page by page for now.
> > + * Both source pages and destination pages can physically not
> contiguous,
> > + * there is no good way to migrate multiple pages per blitter command.
> > + */
>
> Touched on this a bunch throughout the series, lets do better than a
> page a time migration.
>
> Algorithm should be very similar to what I discussed here [4] but with a
> few key differences.
>
> - I think the sram pages can be unpopulated (page == NULL) if the user
> has not yet touched the page
> - Also I think the MIGRATE_PFN_MIGRATE bit being clear is valid
>
> In these cases this indicate we have to issue a copy for the pages we
> are accumulating with contigous vram addresses.
>
> [4] https://patchwork.freedesktop.org/patch/588526/?series=132229&rev=1
>
> > + for (i = 0; i < npages; i++) {
> > + src_page = migrate_pfn_to_page(migrate.src[i]);
> > + if (unlikely(!src_page || !(migrate.src[i] &
> MIGRATE_PFN_MIGRATE)))
>
> Discussed this in the CPU fault patch, once we call migrate_vma_setup,
> on subsequent errors we need to call migrate_vma_finalize to revert the
> pages to the original state. At least I think if I am reading the doc
> after this correctly.
>
> Here on error we just free the pages...
>
> Matt
>
> > + goto free_dst_page;
> > +
> > + xe_assert(tile->xe, !is_zone_device_page(src_page));
> > + src_dma_addr[i] = dma_map_page(dev, src_page, 0,
> PAGE_SIZE, DMA_TO_DEVICE);
> > + if (unlikely(dma_mapping_error(dev, src_dma_addr[i]))) {
> > + drm_warn(&tile->xe->drm, "dma map error for host
> pfn %lx\n", migrate.src[i]);
> > + goto free_dst_page;
> > + }
> > + dst_dpa = xe_mem_region_pfn_to_dpa(mr, migrate.dst[i]);
> > + fence = xe_migrate_pa(tile->migrate, src_dma_addr[i], false,
> > + dst_dpa, true, PAGE_SIZE);
> > + if (IS_ERR(fence)) {
> > + drm_warn(&tile->xe->drm, "migrate host page
> (pfn: %lx) to vram failed\n",
> > + migrate.src[i]);
> > + /**Migration is best effort. Even we failed here, we
> continue*/
> > + goto free_dst_page;
> > + }
> > + /**FIXME: Use the first migration's out fence as the second
> migration's input fence,
> > + * and so on. Only wait the out fence of last migration?
> > + */
> > + dma_fence_wait(fence, false);
> > + dma_fence_put(fence);
> > +free_dst_page:
> > + xe_devm_page_free(pfn_to_page(migrate.dst[i]));
> > + }
> > +
> > + for (i = 0; i < npages; i++)
> > + if (!(dma_mapping_error(dev, src_dma_addr[i])))
> > + dma_unmap_page(dev, src_dma_addr[i], PAGE_SIZE,
> DMA_TO_DEVICE);
> > +
> > + migrate_vma_pages(&migrate);
> > + migrate_vma_finalize(&migrate);
> > +free_dst_pages:
> > + if (ret)
> > + xe_devm_free_blocks(&blocks);
> > +kfree_buf:
> > + kfree(buf);
> > + return ret;
> > +}
> > --
> > 2.26.3
> >
More information about the Intel-xe
mailing list