[v2 28/31] drm/xe/svm: Introduce helper to migrate vma to vram

Mon Apr 15 19:40:04 UTC 2024

On Fri, Apr 12, 2024 at 03:21:04PM -0600, Zeng, Oak wrote:
> 
> 
> > -----Original Message-----
> > From: Brost, Matthew <matthew.brost at intel.com>
> > Sent: Wednesday, April 10, 2024 10:49 PM
> > To: Zeng, Oak <oak.zeng at intel.com>
> > Cc: intel-xe at lists.freedesktop.org; Ghimiray, Himal Prasad
> > <himal.prasad.ghimiray at intel.com>; Bommu, Krishnaiah
> > <krishnaiah.bommu at intel.com>; Thomas.Hellstrom at linux.intel.com; Welty,
> > Brian <brian.welty at intel.com>
> > Subject: Re: [v2 28/31] drm/xe/svm: Introduce helper to migrate vma to vram
> > 
> > On Tue, Apr 09, 2024 at 04:17:39PM -0400, Oak Zeng wrote:
> > > Introduce a helper function xe_svm_migrate_vma_to_vram.
> > >
> > > Since the source pages of the svm range can be physically not
> > > contiguous, and the destination vram pages can also be not
> > > contiguous, there is no easy way to migrate multiple pages per
> > > blitter command. We do page by page migration for now.
> > >
> > > Migration is best effort. Even if we fail to migrate some pages,
> > > we will try to migrate the rest pages.
> > >
> > > FIXME: Use one blitter command to copy when both src and dst are
> > > physically contiguous
> > >
> > 
> > Yep, touch in this throughout the series. Only vram needs to be
> > contiguous though as we dynamically create PT mappings for sram pages in
> > the migrate code. Getting this in a must and should be done immediately
> > IMO as this a very, very basic perform thing we know needs to be done.
> > We will likely have to tune this code quite a bit for performance so
> > getting known things done would be helpful.
> > 
> > > FIXME: when a vma is partially migrated, split vma as we assume
> > > no mixture vma placement.
> > >
> > 
> > Agree we do not want support partial migrations. We likely want to
> > return -EAGAIN for something and fault back to a smaller xe_vma chunk
> > size which I discussed in [1] and add comment on in [2].
> > 
> > Migration should be best case too if we fail migrate we can always leave
> > backing store in sram too.
> > 
> > I do have question though, when do we get partial migrations? A user
> > having called mlock on some of the pages? I just want to make sure I
> > fully understand that case.
> 
> Yah, mlock could be one case...
> 
> I also looked the hmm code. There are few other cases where MIGRATE_PFN_MIGRATE is not set (so we skip migration after), such as pte is NULL and vma is file-backed (not anonymous); entry is swapped out to hard disk etc. see more details in function migrate_vma_collect_pmd.
> 
> 
> > 
> > [1] https://patchwork.freedesktop.org/patch/588526/?series=132229&rev=1
> > [2] https://patchwork.freedesktop.org/patch/588528/?series=132229&rev=1
> > 
> > > Signed-off-by: Oak Zeng <oak.zeng at intel.com>
> > > Co-developed-by: Niranjana Vishwanathapura
> > <niranjana.vishwanathapura at intel.com>
> > > Signed-off-by: Niranjana Vishwanathapura
> > <niranjana.vishwanathapura at intel.com>
> > > Cc: Matthew Brost <matthew.brost at intel.com>
> > > Cc: Thomas Hellström <thomas.hellstrom at intel.com>
> > > Cc: Brian Welty <brian.welty at intel.com>
> > > ---
> > >  drivers/gpu/drm/xe/xe_svm.h         |   2 +
> > >  drivers/gpu/drm/xe/xe_svm_migrate.c | 115
> > ++++++++++++++++++++++++++++
> > 
> > Same comment on file structure throughout the series apply here too.
> > 
> > >  2 files changed, 117 insertions(+)
> > >
> > > diff --git a/drivers/gpu/drm/xe/xe_svm.h b/drivers/gpu/drm/xe/xe_svm.h
> > > index c9e4239c44b4..18ce2e3757c5 100644
> > > --- a/drivers/gpu/drm/xe/xe_svm.h
> > > +++ b/drivers/gpu/drm/xe/xe_svm.h
> > > @@ -83,4 +83,6 @@ int xe_devm_alloc_pages(struct xe_tile *tile,
> > >  void xe_devm_free_blocks(struct list_head *blocks);
> > >  void xe_devm_page_free(struct page *page);
> > >  vm_fault_t xe_svm_migrate_to_sram(struct vm_fault *vmf);
> > > +int xe_svm_migrate_vma_to_vram(struct xe_vm *vm, struct xe_vma *vma,
> > > +							struct xe_tile *tile);
> > >  #endif
> > > diff --git a/drivers/gpu/drm/xe/xe_svm_migrate.c
> > b/drivers/gpu/drm/xe/xe_svm_migrate.c
> > > index 0db831af098e..ab8dd1f58aa4 100644
> > > --- a/drivers/gpu/drm/xe/xe_svm_migrate.c
> > > +++ b/drivers/gpu/drm/xe/xe_svm_migrate.c
> > > @@ -220,3 +220,118 @@ vm_fault_t xe_svm_migrate_to_sram(struct
> > vm_fault *vmf)
> > >  	kvfree(buf);
> > >  	return 0;
> > >  }
> > > +
> > > +/**
> > > + * xe_svm_migrate_vma_to_vram() - migrate backing store of a vma to
> > vram
> > > + * Must be called with mmap_read_lock held.
> > > + * @vm: the vm that the vma belongs to
> > > + * @vma: the vma to migrate.
> > > + * @tile: the destination tile which holds the new backing store of the range
> > > + *
> > > + * Returns: negative errno on faiure, 0 on success
> > > + */
> > > +int xe_svm_migrate_vma_to_vram(struct xe_vm *vm,
> > > +							struct xe_vma *vma,
> > > +							struct xe_tile *tile)
> > > +{
> > > +	struct mm_struct *mm = vm->mm;
> > > +	unsigned long start = xe_vma_start(vma);
> > > +	unsigned long end = xe_vma_end(vma);
> > > +	unsigned long npages = (end - start) >> PAGE_SHIFT;
> > > +	struct xe_mem_region *mr = &tile->mem.vram;
> > > +	struct vm_area_struct *vas;
> > > +
> > > +	struct migrate_vma migrate = {
> > > +		.start		= start,
> > > +		.end		= end,
> > > +		.pgmap_owner	= tile->xe,
> > 
> > Again helper to assign owner.
> > 
> > > +		.flags          = MIGRATE_VMA_SELECT_SYSTEM,
> > > +	};
> > > +	struct device *dev = tile->xe->drm.dev;
> > > +	dma_addr_t *src_dma_addr;
> > > +	struct dma_fence *fence;
> > > +	struct page *src_page;
> > > +	LIST_HEAD(blocks);
> > > +	int ret = 0, i;
> > > +	u64 dst_dpa;
> > > +	void *buf;
> > > +
> > > +	mmap_assert_locked(mm);
> > 
> > This mmap_assert_locked is ambiguous, we should make it clear if this
> > read or write locked. Doesn't it have to be write to do the migrate
> > pages?
> 
> I followed hmm document (Documents/mm/hmm.rst), see section "Migration to and from device memory". It explicitly write a read_lock in this document.
> 
> I believe a read_lock is enough for the migrate_vma_setup/migrate_vma_finalize().
> 
> As I understand it, the mm.mmap_lock protect the process's address space. When we modify process's address space such as mmap/munmap, we need to hold a write mode lock; if we only read process's address space, such as in the migrate_vma_setup/finalize, or in the cpu page fault handler case, we only need a read mode lock. 
> 
> I also cross checked amd driver. They also use a read lock.. see function svm_range_restore_pages in kfd_svm.c....
> 

Yea, I see that too. Trying to figure out the locking, IMO the locking
document might actually be wrong, or the very least the locking design
is very ill-conceived. We can discuss internally a bit before I
publically share my grievances.

> 
> > 
> > A larger question about the locking... The CPU fault handler holds the
> > mmap lock in write mode, right?
> 
> No. since cpu fault handler doesn't modify process address space, instead it only fill up cpu page table for some valid address range, a read lock is enough. 
> 

Ah, yes after digging around a bit I see this.

> > 
> > I'm asking as basically I think at least initially we want to hold the
> > mmap lock in a way that the GPU handler and CPU handler do not race.
> > i.e. From fault userptr create in GPU fault handler to issuing the bind
> > we prevent the CPU fault handler from running.
> 
> Yes we hold mmap_read_lock in both cpu and gpu fault handler to avoid that race.
>

That's not how rw lock work. 2 threads can both hold the read lock in
parallel (shared read access), only 1 thread hold the write lock
(exclusive write access, no one can hold read lock either). Thus my
concern about the cpu and gpu fault handler running in parallel and the
larger locking design questions. Again we can talk through this in
detail internally a bit.

> In user mmap/munmap (such as kernel function vm_munmap), we hold a mmap_write_lock which prevent it racing with cpu and gpu fault handler.
> 
> 
> > 
> > I'm having issues figuring out how to prevent races between initial
> > binds of userptrs and userptr invalidates on faulting VMs. This race is
> > seen any xe_exec_fault_mode for example... So preventing races between
> > CPU / GPU fault handler with the mmap probably is a good idea initially.
> > Likely can make the locking finer grained once this is all working and I
> > figure out how to handle this race better.
> 
> 
> I don't quite follow here. 
> 
> Initial bind of user ptr... if you meant the bind in gpu page fault handler, then the racing with invalidation is roughly like below:
> Invalidation is called with mmap_write_lock

Is it? If the notifer does the invalidation via migrate_vma_setup in the
CPU fault handler we established about that only the mmap_read_lock is
held.

> In userptr_pin_page, we hold a mmap_read_lock, so we know during pin page, invalidation is excluded.

Nope, see above invalidation can happen when userptr_pin_page is
executing because of the read lock. The seqno check (described below) is
what prevents programming of bad page tables.

> After pin, before programming gpu page table, we check whether there is invalidation happen *after last pin but before programming page table*, if that happened, we retry
>

Yes, that is how it works on tip but I am refactoring it in [1]. I was
trying to avoid the retry loop by turning PDE/PTE writes into clears if an
invalidation happened but not sure if that works without a larger
refactor due to nature PDEs being shared. I may need the retry loop but
that also gets tricky with array of binds... A few options here and will
land a on solution once [2] is merged.

Regardless my point here is still valid - it may not be the worst idea
when getting this code initially up and running just to grab
mmap_write_lock in GPU fault handler as a BLK (big kernel lock) to
prevent all races. Once the code is stable and stress testing in place,
switch to finer grained locking as define in HMM document (or newly
defined if we fine locking design is insufficient).

Matt

[1] https://patchwork.freedesktop.org/series/125608/
[2] https://patchwork.freedesktop.org/series/132246/

> 
> 
> Oak
> 
> > 
> > > +
> > > +	vas = find_vma_intersection(mm, start, start + 4);
> > 
> > find_vma should work fine here.
> > 
> > > +	if (!vas)
> > > +		return -ENOENT;
> > > +
> > > +	migrate.vma = vas;
> > > +	buf = kvcalloc(npages, 2* sizeof(*migrate.src) + sizeof(*src_dma_addr),
> > > +					GFP_KERNEL);
> > > +	if(!buf)
> > > +		return -ENOMEM;
> > > +	migrate.src = buf;
> > > +	migrate.dst = migrate.src + npages;
> > > +	src_dma_addr = (dma_addr_t *) (migrate.dst + npages);
> > > +	ret = xe_devm_alloc_pages(tile, npages, &blocks, migrate.dst);
> > 
> > Again as I discussed in [3] I think this should be broken out into a
> > different step with the blocks allocated before this, and here just
> > populate migrate.dst from the existing blocks.
> > 
> > [3] https://patchwork.freedesktop.org/patch/588523/?series=132229&rev=1
> > 
> > > +	if (ret)
> > > +		goto kfree_buf;
> > > +
> > > +	ret = migrate_vma_setup(&migrate);
> > > +	if (ret) {
> > > +		drm_err(&tile->xe->drm, "vma setup returned %d for range
> > [%lx - %lx]\n",
> > > +				ret, start, end);
> > > +		goto free_dst_pages;
> > > +	}
> > > +
> > > +	/**FIXME: partial migration of a range print a warning for now.
> > > +	 * If this message is printed, we need to split xe_vma as we
> > > +	 * don't support a mixture placement of one vma
> > > +	 */
> > > +	if (migrate.cpages != npages)
> > > +		drm_warn(&tile->xe->drm, "Partial migration for range [%lx -
> >  %lx], range is %ld pages, migrate only %ld pages\n",
> > > +				start, end, npages, migrate.cpages);
> > 
> > As discussed above, we shouldn't support this. We should fall back to
> > smaller xe_vma chunk size until we find one that works or simply leave
> > the pages in sram and map those pages to GPU.
> > 
> > > +
> > > +	/**Migrate page by page for now.
> > > +	 * Both source pages and destination pages can physically not
> > contiguous,
> > > +	 * there is no good way to migrate multiple pages per blitter command.
> > > +	 */
> > 
> > Touched on this a bunch throughout the series, lets do better than a
> > page a time migration.
> > 
> > Algorithm should be very similar to what I discussed here [4] but with a
> > few key differences.
> > 
> > - I think the sram pages can be unpopulated (page == NULL) if the user
> >   has not yet touched the page
> > - Also I think the MIGRATE_PFN_MIGRATE bit being clear is valid
> > 
> > In these cases this indicate we have to issue a copy for the pages we
> > are accumulating with contigous vram addresses.
> > 
> > [4] https://patchwork.freedesktop.org/patch/588526/?series=132229&rev=1
> > 
> > > +	for (i = 0; i < npages; i++) {
> > > +		src_page = migrate_pfn_to_page(migrate.src[i]);
> > > +		if (unlikely(!src_page || !(migrate.src[i] &
> > MIGRATE_PFN_MIGRATE)))
> > 
> > Discussed this in the CPU fault patch, once we call migrate_vma_setup,
> > on subsequent errors we need to call migrate_vma_finalize to revert the
> > pages to the original state. At least I think if I am reading the doc
> > after this correctly.
> > 
> > Here on error we just free the pages...
> > 
> > Matt
> > 
> > > +			goto free_dst_page;
> > > +
> > > +		xe_assert(tile->xe, !is_zone_device_page(src_page));
> > > +		src_dma_addr[i] = dma_map_page(dev, src_page, 0,
> > PAGE_SIZE, DMA_TO_DEVICE);
> > > +		if (unlikely(dma_mapping_error(dev, src_dma_addr[i]))) {
> > > +			drm_warn(&tile->xe->drm, "dma map error for host
> > pfn %lx\n", migrate.src[i]);
> > > +			goto free_dst_page;
> > > +		}
> > > +		dst_dpa = xe_mem_region_pfn_to_dpa(mr, migrate.dst[i]);
> > > +		fence = xe_migrate_pa(tile->migrate, src_dma_addr[i], false,
> > > +				dst_dpa, true, PAGE_SIZE);
> > > +		if (IS_ERR(fence)) {
> > > +			drm_warn(&tile->xe->drm, "migrate host page
> > (pfn: %lx) to vram failed\n",
> > > +					migrate.src[i]);
> > > +			/**Migration is best effort. Even we failed here, we
> > continue*/
> > > +			goto free_dst_page;
> > > +		}
> > > +		/**FIXME: Use the first migration's out fence as the second
> > migration's input fence,
> > > +		 * and so on. Only wait the out fence of last migration?
> > > +		 */
> > > +		dma_fence_wait(fence, false);
> > > +		dma_fence_put(fence);
> > > +free_dst_page:
> > > +		xe_devm_page_free(pfn_to_page(migrate.dst[i]));
> > > +	}
> > > +
> > > +	for (i = 0; i < npages; i++)
> > > +		if (!(dma_mapping_error(dev, src_dma_addr[i])))
> > > +			dma_unmap_page(dev, src_dma_addr[i], PAGE_SIZE,
> > DMA_TO_DEVICE);
> > > +
> > > +	migrate_vma_pages(&migrate);
> > > +	migrate_vma_finalize(&migrate);
> > > +free_dst_pages:
> > > +	if (ret)
> > > +		xe_devm_free_blocks(&blocks);
> > > +kfree_buf:
> > > +	kfree(buf);
> > > +	return ret;
> > > +}
> > > --
> > > 2.26.3
> > >