[v2 28/31] drm/xe/svm: Introduce helper to migrate vma to vram

Fri Apr 12 21:21:04 UTC 2024

> -----Original Message-----
> From: Brost, Matthew <matthew.brost at intel.com>
> Sent: Wednesday, April 10, 2024 10:49 PM
> To: Zeng, Oak <oak.zeng at intel.com>
> Cc: intel-xe at lists.freedesktop.org; Ghimiray, Himal Prasad
> <himal.prasad.ghimiray at intel.com>; Bommu, Krishnaiah
> <krishnaiah.bommu at intel.com>; Thomas.Hellstrom at linux.intel.com; Welty,
> Brian <brian.welty at intel.com>
> Subject: Re: [v2 28/31] drm/xe/svm: Introduce helper to migrate vma to vram
> 
> On Tue, Apr 09, 2024 at 04:17:39PM -0400, Oak Zeng wrote:
> > Introduce a helper function xe_svm_migrate_vma_to_vram.
> >
> > Since the source pages of the svm range can be physically not
> > contiguous, and the destination vram pages can also be not
> > contiguous, there is no easy way to migrate multiple pages per
> > blitter command. We do page by page migration for now.
> >
> > Migration is best effort. Even if we fail to migrate some pages,
> > we will try to migrate the rest pages.
> >
> > FIXME: Use one blitter command to copy when both src and dst are
> > physically contiguous
> >
> 
> Yep, touch in this throughout the series. Only vram needs to be
> contiguous though as we dynamically create PT mappings for sram pages in
> the migrate code. Getting this in a must and should be done immediately
> IMO as this a very, very basic perform thing we know needs to be done.
> We will likely have to tune this code quite a bit for performance so
> getting known things done would be helpful.
> 
> > FIXME: when a vma is partially migrated, split vma as we assume
> > no mixture vma placement.
> >
> 
> Agree we do not want support partial migrations. We likely want to
> return -EAGAIN for something and fault back to a smaller xe_vma chunk
> size which I discussed in [1] and add comment on in [2].
> 
> Migration should be best case too if we fail migrate we can always leave
> backing store in sram too.
> 
> I do have question though, when do we get partial migrations? A user
> having called mlock on some of the pages? I just want to make sure I
> fully understand that case.

Yah, mlock could be one case...

I also looked the hmm code. There are few other cases where MIGRATE_PFN_MIGRATE is not set (so we skip migration after), such as pte is NULL and vma is file-backed (not anonymous); entry is swapped out to hard disk etc. see more details in function migrate_vma_collect_pmd.

> 
> [1] https://patchwork.freedesktop.org/patch/588526/?series=132229&rev=1
> [2] https://patchwork.freedesktop.org/patch/588528/?series=132229&rev=1
> 
> > Signed-off-by: Oak Zeng <oak.zeng at intel.com>
> > Co-developed-by: Niranjana Vishwanathapura
> <niranjana.vishwanathapura at intel.com>
> > Signed-off-by: Niranjana Vishwanathapura
> <niranjana.vishwanathapura at intel.com>
> > Cc: Matthew Brost <matthew.brost at intel.com>
> > Cc: Thomas Hellström <thomas.hellstrom at intel.com>
> > Cc: Brian Welty <brian.welty at intel.com>
> > ---
> >  drivers/gpu/drm/xe/xe_svm.h         |   2 +
> >  drivers/gpu/drm/xe/xe_svm_migrate.c | 115
> ++++++++++++++++++++++++++++
> 
> Same comment on file structure throughout the series apply here too.
> 
> >  2 files changed, 117 insertions(+)
> >
> > diff --git a/drivers/gpu/drm/xe/xe_svm.h b/drivers/gpu/drm/xe/xe_svm.h
> > index c9e4239c44b4..18ce2e3757c5 100644
> > --- a/drivers/gpu/drm/xe/xe_svm.h
> > +++ b/drivers/gpu/drm/xe/xe_svm.h
> > @@ -83,4 +83,6 @@ int xe_devm_alloc_pages(struct xe_tile *tile,
> >  void xe_devm_free_blocks(struct list_head *blocks);
> >  void xe_devm_page_free(struct page *page);
> >  vm_fault_t xe_svm_migrate_to_sram(struct vm_fault *vmf);
> > +int xe_svm_migrate_vma_to_vram(struct xe_vm *vm, struct xe_vma *vma,
> > +							struct xe_tile *tile);
> >  #endif
> > diff --git a/drivers/gpu/drm/xe/xe_svm_migrate.c
> b/drivers/gpu/drm/xe/xe_svm_migrate.c
> > index 0db831af098e..ab8dd1f58aa4 100644
> > --- a/drivers/gpu/drm/xe/xe_svm_migrate.c
> > +++ b/drivers/gpu/drm/xe/xe_svm_migrate.c
> > @@ -220,3 +220,118 @@ vm_fault_t xe_svm_migrate_to_sram(struct
> vm_fault *vmf)
> >  	kvfree(buf);
> >  	return 0;
> >  }
> > +
> > +/**
> > + * xe_svm_migrate_vma_to_vram() - migrate backing store of a vma to
> vram
> > + * Must be called with mmap_read_lock held.
> > + * @vm: the vm that the vma belongs to
> > + * @vma: the vma to migrate.
> > + * @tile: the destination tile which holds the new backing store of the range
> > + *
> > + * Returns: negative errno on faiure, 0 on success
> > + */
> > +int xe_svm_migrate_vma_to_vram(struct xe_vm *vm,
> > +							struct xe_vma *vma,
> > +							struct xe_tile *tile)
> > +{
> > +	struct mm_struct *mm = vm->mm;
> > +	unsigned long start = xe_vma_start(vma);
> > +	unsigned long end = xe_vma_end(vma);
> > +	unsigned long npages = (end - start) >> PAGE_SHIFT;
> > +	struct xe_mem_region *mr = &tile->mem.vram;
> > +	struct vm_area_struct *vas;
> > +
> > +	struct migrate_vma migrate = {
> > +		.start		= start,
> > +		.end		= end,
> > +		.pgmap_owner	= tile->xe,
> 
> Again helper to assign owner.
> 
> > +		.flags          = MIGRATE_VMA_SELECT_SYSTEM,
> > +	};
> > +	struct device *dev = tile->xe->drm.dev;
> > +	dma_addr_t *src_dma_addr;
> > +	struct dma_fence *fence;
> > +	struct page *src_page;
> > +	LIST_HEAD(blocks);
> > +	int ret = 0, i;
> > +	u64 dst_dpa;
> > +	void *buf;
> > +
> > +	mmap_assert_locked(mm);
> 
> This mmap_assert_locked is ambiguous, we should make it clear if this
> read or write locked. Doesn't it have to be write to do the migrate
> pages?

I followed hmm document (Documents/mm/hmm.rst), see section "Migration to and from device memory". It explicitly write a read_lock in this document.

I believe a read_lock is enough for the migrate_vma_setup/migrate_vma_finalize().

As I understand it, the mm.mmap_lock protect the process's address space. When we modify process's address space such as mmap/munmap, we need to hold a write mode lock; if we only read process's address space, such as in the migrate_vma_setup/finalize, or in the cpu page fault handler case, we only need a read mode lock. 

I also cross checked amd driver. They also use a read lock.. see function svm_range_restore_pages in kfd_svm.c....

> 
> A larger question about the locking... The CPU fault handler holds the
> mmap lock in write mode, right?

No. since cpu fault handler doesn't modify process address space, instead it only fill up cpu page table for some valid address range, a read lock is enough. 

> 
> I'm asking as basically I think at least initially we want to hold the
> mmap lock in a way that the GPU handler and CPU handler do not race.
> i.e. From fault userptr create in GPU fault handler to issuing the bind
> we prevent the CPU fault handler from running.

Yes we hold mmap_read_lock in both cpu and gpu fault handler to avoid that race.

In user mmap/munmap (such as kernel function vm_munmap), we hold a mmap_write_lock which prevent it racing with cpu and gpu fault handler.

> 
> I'm having issues figuring out how to prevent races between initial
> binds of userptrs and userptr invalidates on faulting VMs. This race is
> seen any xe_exec_fault_mode for example... So preventing races between
> CPU / GPU fault handler with the mmap probably is a good idea initially.
> Likely can make the locking finer grained once this is all working and I
> figure out how to handle this race better.

I don't quite follow here. 

Initial bind of user ptr... if you meant the bind in gpu page fault handler, then the racing with invalidation is roughly like below:
Invalidation is called with mmap_write_lock
In userptr_pin_page, we hold a mmap_read_lock, so we know during pin page, invalidation is excluded.
After pin, before programming gpu page table, we check whether there is invalidation happen *after last pin but before programming page table*, if that happened, we retry

Oak

> 
> > +
> > +	vas = find_vma_intersection(mm, start, start + 4);
> 
> find_vma should work fine here.
> 
> > +	if (!vas)
> > +		return -ENOENT;
> > +
> > +	migrate.vma = vas;
> > +	buf = kvcalloc(npages, 2* sizeof(*migrate.src) + sizeof(*src_dma_addr),
> > +					GFP_KERNEL);
> > +	if(!buf)
> > +		return -ENOMEM;
> > +	migrate.src = buf;
> > +	migrate.dst = migrate.src + npages;
> > +	src_dma_addr = (dma_addr_t *) (migrate.dst + npages);
> > +	ret = xe_devm_alloc_pages(tile, npages, &blocks, migrate.dst);
> 
> Again as I discussed in [3] I think this should be broken out into a
> different step with the blocks allocated before this, and here just
> populate migrate.dst from the existing blocks.
> 
> [3] https://patchwork.freedesktop.org/patch/588523/?series=132229&rev=1
> 
> > +	if (ret)
> > +		goto kfree_buf;
> > +
> > +	ret = migrate_vma_setup(&migrate);
> > +	if (ret) {
> > +		drm_err(&tile->xe->drm, "vma setup returned %d for range
> [%lx - %lx]\n",
> > +				ret, start, end);
> > +		goto free_dst_pages;
> > +	}
> > +
> > +	/**FIXME: partial migration of a range print a warning for now.
> > +	 * If this message is printed, we need to split xe_vma as we
> > +	 * don't support a mixture placement of one vma
> > +	 */
> > +	if (migrate.cpages != npages)
> > +		drm_warn(&tile->xe->drm, "Partial migration for range [%lx -
>  %lx], range is %ld pages, migrate only %ld pages\n",
> > +				start, end, npages, migrate.cpages);
> 
> As discussed above, we shouldn't support this. We should fall back to
> smaller xe_vma chunk size until we find one that works or simply leave
> the pages in sram and map those pages to GPU.
> 
> > +
> > +	/**Migrate page by page for now.
> > +	 * Both source pages and destination pages can physically not
> contiguous,
> > +	 * there is no good way to migrate multiple pages per blitter command.
> > +	 */
> 
> Touched on this a bunch throughout the series, lets do better than a
> page a time migration.
> 
> Algorithm should be very similar to what I discussed here [4] but with a
> few key differences.
> 
> - I think the sram pages can be unpopulated (page == NULL) if the user
>   has not yet touched the page
> - Also I think the MIGRATE_PFN_MIGRATE bit being clear is valid
> 
> In these cases this indicate we have to issue a copy for the pages we
> are accumulating with contigous vram addresses.
> 
> [4] https://patchwork.freedesktop.org/patch/588526/?series=132229&rev=1
> 
> > +	for (i = 0; i < npages; i++) {
> > +		src_page = migrate_pfn_to_page(migrate.src[i]);
> > +		if (unlikely(!src_page || !(migrate.src[i] &
> MIGRATE_PFN_MIGRATE)))
> 
> Discussed this in the CPU fault patch, once we call migrate_vma_setup,
> on subsequent errors we need to call migrate_vma_finalize to revert the
> pages to the original state. At least I think if I am reading the doc
> after this correctly.
> 
> Here on error we just free the pages...
> 
> Matt
> 
> > +			goto free_dst_page;
> > +
> > +		xe_assert(tile->xe, !is_zone_device_page(src_page));
> > +		src_dma_addr[i] = dma_map_page(dev, src_page, 0,
> PAGE_SIZE, DMA_TO_DEVICE);
> > +		if (unlikely(dma_mapping_error(dev, src_dma_addr[i]))) {
> > +			drm_warn(&tile->xe->drm, "dma map error for host
> pfn %lx\n", migrate.src[i]);
> > +			goto free_dst_page;
> > +		}
> > +		dst_dpa = xe_mem_region_pfn_to_dpa(mr, migrate.dst[i]);
> > +		fence = xe_migrate_pa(tile->migrate, src_dma_addr[i], false,
> > +				dst_dpa, true, PAGE_SIZE);
> > +		if (IS_ERR(fence)) {
> > +			drm_warn(&tile->xe->drm, "migrate host page
> (pfn: %lx) to vram failed\n",
> > +					migrate.src[i]);
> > +			/**Migration is best effort. Even we failed here, we
> continue*/
> > +			goto free_dst_page;
> > +		}
> > +		/**FIXME: Use the first migration's out fence as the second
> migration's input fence,
> > +		 * and so on. Only wait the out fence of last migration?
> > +		 */
> > +		dma_fence_wait(fence, false);
> > +		dma_fence_put(fence);
> > +free_dst_page:
> > +		xe_devm_page_free(pfn_to_page(migrate.dst[i]));
> > +	}
> > +
> > +	for (i = 0; i < npages; i++)
> > +		if (!(dma_mapping_error(dev, src_dma_addr[i])))
> > +			dma_unmap_page(dev, src_dma_addr[i], PAGE_SIZE,
> DMA_TO_DEVICE);
> > +
> > +	migrate_vma_pages(&migrate);
> > +	migrate_vma_finalize(&migrate);
> > +free_dst_pages:
> > +	if (ret)
> > +		xe_devm_free_blocks(&blocks);
> > +kfree_buf:
> > +	kfree(buf);
> > +	return ret;
> > +}
> > --
> > 2.26.3
> >