[RFC PATCH 23/28] drm/xe: Add SVM VRAM migration

Thu Aug 29 09:53:58 UTC 2024

Hi, Christian,

On Thu, 2024-08-29 at 11:24 +0200, Christian König wrote:
> 

...

> > > - Unified eviction is required (SVM VRAM and TTM BOs need to be
> > > able to
> > >    evict each other).
> > So core mm handles this by just roughly equally shrinking
> > everything.
> > Seems to work, and it has a pile of object shrinkers, and the page
> > lru is
> > also split into page cache and anon memory.
> > 
> > I think you need to put in more justification that unified eviction
> > is
> > required than just stating it, because a look at mm/ gives a very
> > well
> > established counterexample.
> > 
> > > - For exhaustive eviction [1], SVM VRAM allocations will almost
> > > certainly
> > >    require a dma-resv.
> > So from the TTM side we need exhaustive eviction, or at least
> > something a
> > bit more exhaustive than what ttm currently has. Note that i915-gem
> > also
> > never really got to perfect exhaustive eviction, it's just a pile
> > better
> > than ttm right now.
> 
> Please define what exhaustive eviction should mean? I think I know
> what 
> it is and I have been pushing TTM into the direction of solving this
> for 
> years.

We internally refer to exhaustive eviction being a client is always
guaranteed to eventually make progress in obtaining non-pinned vram,
typically by incrementally locking and keeping dma-resvs across a
single validation including validations during buffer object
allocations.

> 
> The last missing puzzle piece is to use drm_exec for TTM evictions,

and IMO keeping the dma-resv locks grabbed during eviction until at
least one unit of progress (one validation) has succeeded.

> but 
> apart from that everything should work now.
> 
> 
> Regards,
> Christian.

But as Sima pointed out in private communication, exhaustive eviction
is not really needed for faulting to make (crawling) progress.
Watermarks and VRAM trylock shrinking should suffice, since we're
strictly only required to service a single gpu page granule at a time.

However, ordinary bo-based jobs would still like to be able to
completely evict SVM vram. Whether that is important enough to strive
for is ofc up for discussion.

/Thomas

> 
> > Now if there's also SVM VRAM managed on a page lru, TTM exhaustive
> > eviction is going to win because the shrinkers can only trylock
> > dma_resv.
> > So this part works. It actually works so well on the system memory
> > side
> > that if we're not careful we can trigger oom, because we're too
> > good at
> > getting at all the memory.
> > 
> > SVM VRAM allocations otoh do not need exhaustive evictions. Or at
> > least I
> > don't see why, because the idea is that thanks to gpu and cpu page
> > faults,
> > you can always get out of a pinch by just trashing everything for a
> > while
> > and migrating the handfull of available pages a lot.
> > 
> > > - Likely allocation size is 2M which makes of size of BO (872)
> > >    acceptable per allocation (872 / 2M == .0004158).
> > > 
> > > With this, using TTM BO for VRAM backing store seems to be an
> > > obvious
> > > choice as it allows leveraging of the TTM eviction code.
> > Except it requires that you hold dma_resv, which brings in all
> > kinds of
> > pain. And for eviction we really don't need a lot of
> > synchronization, so a
> > lot of that locking is not needed, unlike the case where we have a
> > cpu
> > fault, where we absolutely need mmap_lock and all that to make sure
> > we
> > fault in the right page.
> > 
> > But for eviction we only need to throw out some pages, if we're not
> > entirely precise with picking the right ones (or have no idea into
> > which
> > vma they're all currently mapped into) it doesn't matter. That's
> > why
> > migrate_device_pages doesn't care about any of that at all, it
> > doesn't
> > need to by design. But by bo backing memory you drag in all that
> > stuff
> > that's causing headacheds for eviction.
> > 
> > The only thing migration tries to do is remove all pte, and if that
> > succeeds, move the page. Specialized for the gpusvm case, looking
> > at mm/
> > code as cheat sheet, we need roughly:
> > 
> > - reverse mapping structure like anon_vma. Except gpusvm can assume
> > that
> >    there's currently only one gpu side mapping, so we can just
> > stuff the
> >    gpusvm an va_address into the page, and protect it with the page
> > lock.
> > 
> > - we need pagetable locks, so that we can manipulate pagetables
> > (well
> >    specifically make ptes invalid) without taking any other locks.
> > 
> > - everyone else inserting or removing ptes for svm mappings also
> > needs to
> >    lock the page, or we have races. This might be the
> > hmm_range_fault races
> >    you're seeing when allowing vram pages, since I don't think
> > there's
> >    anything else stopping the page lookup otherwise from
> > succeeding.
> > 
> > - we might also need to stuff migrate ptes into the gpu side, like
> > the cpu
> >    does, to hold up refaults before the migration has finished. But
> > I think
> >    those are only needed for anon memory in sram because there's no
> > other
> >    way to find the right page than swap pte entries, of which
> > migration
> >    entries are a special case.
> > 
> > - core code also expects us to handle the page refcount correctly
> > for svm
> >    device memory, so we can't free the pages like normal bo pages
> > either
> >    directly to drm_buddy.
> > 
> > Now typing this all up will look an awful lot like what you have,
> > with the
> > dma_resv lock serving as the page lock and the pagetable lock. The
> > only
> > reason is that these locks are much smaller and nest within all the
> > other
> > stuff going on and so avoid the inversion issues.
> > 
> > So one annoying part is that this is a lot of pointlessly looking
> > typing.
> > The other is that it's full of races, because core mm really is
> > yolo all
> > the way down. So lots of ways you lock the wrong page and fun stuff
> > like
> > that, but the few cases that matter work:
> > 
> > - svm fault handling with hmm_range fault retries with mmu
> > notifiers. Note
> >    that we need to have vram pages locked and the notifier retrie
> > needs to
> >    be under the pagetable lock, or there's room to escape. At least
> > that's
> >    what I came up with last time I thought it all through.
> > 
> > - migrate_to_ram: it will hold a page reference which we know was
> > the
> >    valid vram page when the cpu pte was locked, but it might not be
> > it
> >    anymore. So we have to lock the page and check whether it's
> > still gpu
> >    mapped, and if not retry the entire fault since most likey
> > another
> >    migrate_to_ram has succeed meanwhile in parallel.
> > 
> > - for eviction we don't care, we might actually be migrating a page
> > no one
> >    even wants anymore.
> > 
> > Now I think you can get all this done with the dma_resv lock and
> > maybe the
> > bo refcount. But it does involve a tremendous amount of headaches
> > and
> > impendence mismatch, because that's not how page faults and
> > migrations
> > work in core mm.
> > 
> > Cheers, Sima
> > 
> > > Current migration policy is migrate any SVM range greater than or
> > > equal
> > > to 64k once.
> > > 
> > > [1]https://patchwork.freedesktop.org/series/133643/
> > > 
> > > Signed-off-by: Matthew Brostmatthew.brost at intel.com
> > > ---
> > >   drivers/gpu/drm/xe/xe_svm.c | 81
> > > ++++++++++++++++++++++++++++++++++++-
> > >   drivers/gpu/drm/xe/xe_svm.h |  1 +
> > >   2 files changed, 81 insertions(+), 1 deletion(-)
> > > 
> > > diff --git a/drivers/gpu/drm/xe/xe_svm.c
> > > b/drivers/gpu/drm/xe/xe_svm.c
> > > index 4372c02a341f..fd8987e0a506 100644
> > > --- a/drivers/gpu/drm/xe/xe_svm.c
> > > +++ b/drivers/gpu/drm/xe/xe_svm.c
> > > @@ -217,8 +217,13 @@ static void xe_svm_invalidate(struct
> > > drm_gpusvm *gpusvm,
> > >   static int __xe_svm_garbage_collector(struct xe_vm *vm,
> > >   				      struct xe_svm_range
> > > *range)
> > >   {
> > > +	struct drm_gpusvm_ctx ctx = {};
> > >   	struct dma_fence *fence;
> > >   
> > > +	/* Evict any pages holding references to vram allocation
> > > */
> > > +	if (range->base.flags.partial_unmap && IS_DGFX(vm->xe))
> > > +		drm_gpusvm_migrate_to_sram(&vm->svm.gpusvm,
> > > &range->base, &ctx);
> > > +
> > >   	xe_vm_lock(vm, false);
> > >   	fence = xe_vm_range_unbind(vm, range);
> > >   	xe_vm_unlock(vm);
> > > @@ -504,21 +509,77 @@ static bool xe_svm_range_is_valid(struct
> > > xe_svm_range *range,
> > >   	return (range->tile_present & ~range->tile_invalidated)
> > > & BIT(tile->id);
> > >   }
> > >   
> > > +static struct xe_mem_region *tile_to_mr(struct xe_tile *tile)
> > > +{
> > > +	return &tile->mem.vram;
> > > +}
> > > +
> > > +static struct xe_bo *xe_svm_alloc_vram(struct xe_vm *vm, struct
> > > xe_tile *tile,
> > > +				       struct xe_svm_range
> > > *range,
> > > +				       const struct
> > > drm_gpusvm_ctx *ctx)
> > > +{
> > > +	struct xe_mem_region *mr = tile_to_mr(tile);
> > > +	struct drm_buddy_block *block;
> > > +	struct list_head *blocks;
> > > +	struct xe_bo *bo;
> > > +	ktime_t end = 0;
> > > +	int err;
> > > +
> > > +retry:
> > > +	xe_vm_lock(vm, false);
> > > +	bo = xe_bo_create(tile_to_xe(tile), tile, vm, range-
> > > >base.va.end -
> > > +			  range->base.va.start,
> > > ttm_bo_type_device,
> > > +			  XE_BO_FLAG_VRAM_IF_DGFX(tile) |
> > > +			  XE_BO_FLAG_SYSTEM_ALLOC |
> > > XE_BO_FLAG_SKIP_CLEAR);
> > > +	xe_vm_unlock(vm);
> > > +	if (IS_ERR(bo)) {
> > > +		err = PTR_ERR(bo);
> > > +		if (xe_vm_validate_should_retry(NULL, err,
> > > &end))
> > > +			goto retry;
> > > +		return bo;
> > > +	}
> > > +
> > > +	blocks = &to_xe_ttm_vram_mgr_resource(bo->ttm.resource)-
> > > >blocks;
> > > +	list_for_each_entry(block, blocks, link)
> > > +		block->private = mr;
> > > +
> > > +	/*
> > > +	 * Take ref because as soon as
> > > drm_gpusvm_migrate_to_vram succeeds the
> > > +	 * creation ref can be dropped upon CPU fault or unmap.
> > > +	 */
> > > +	xe_bo_get(bo);
> > > +
> > > +	err = drm_gpusvm_migrate_to_vram(&vm->svm.gpusvm,
> > > &range->base,
> > > +					 bo, ctx);
> > > +	if (err) {
> > > +		xe_bo_put(bo);	/* Local ref */
> > > +		xe_bo_put(bo);	/* Creation ref */
> > > +		return ERR_PTR(err);
> > > +	}
> > > +
> > > +	return bo;
> > > +}
> > > +
> > >   int xe_svm_handle_pagefault(struct xe_vm *vm, struct xe_vma
> > > *vma,
> > >   			    struct xe_tile *tile, u64
> > > fault_addr,
> > >   			    bool atomic)
> > >   {
> > > -	struct drm_gpusvm_ctx ctx = { .read_only =
> > > xe_vma_read_only(vma), };
> > > +	struct drm_gpusvm_ctx ctx = { .read_only =
> > > xe_vma_read_only(vma),
> > > +		.vram_possible = IS_DGFX(vm->xe), };
> > >   	struct xe_svm_range *range;
> > >   	struct drm_gpusvm_range *r;
> > >   	struct drm_exec exec;
> > >   	struct dma_fence *fence;
> > > +	struct xe_bo *bo = NULL;
> > >   	ktime_t end = 0;
> > >   	int err;
> > >   
> > >   	lockdep_assert_held_write(&vm->lock);
> > >   
> > >   retry:
> > > +	xe_bo_put(bo);
> > > +	bo = NULL;
> > > +
> > >   	/* Always process UNMAPs first so view SVM ranges is
> > > current */
> > >   	err = xe_svm_garbage_collector(vm);
> > >   	if (err)
> > > @@ -534,6 +595,22 @@ int xe_svm_handle_pagefault(struct xe_vm
> > > *vm, struct xe_vma *vma,
> > >   	if (xe_svm_range_is_valid(range, tile))
> > >   		return 0;
> > >   
> > > +	/* XXX: Add migration policy, for now migrate range once
> > > */
> > > +	if (IS_DGFX(vm->xe) && !range->migrated &&
> > > +	    range->base.flags.migrate_vram &&
> > > +	    (range->base.va.end - range->base.va.start) >=
> > > SZ_64K) {
> > > +		range->migrated = true;
> > > +
> > > +		bo = xe_svm_alloc_vram(vm, tile, range, &ctx);
> > > +		if (IS_ERR(bo)) {
> > > +			drm_info(&vm->xe->drm,
> > > +				 "VRAM allocation failed,
> > > falling back to retrying, asid=%u, errno %ld\n",
> > > +				 vm->usm.asid, PTR_ERR(bo));
> > > +			bo = NULL;
> > > +			goto retry;
> > > +		}
> > > +	}
> > > +
> > >   	err = drm_gpusvm_range_get_pages(&vm->svm.gpusvm, r,
> > > &ctx);
> > >   	if (err == -EFAULT || err == -EPERM)	/* Corner where
> > > CPU mappings have change */
> > >   	       goto retry;
> > > @@ -567,6 +644,8 @@ int xe_svm_handle_pagefault(struct xe_vm *vm,
> > > struct xe_vma *vma,
> > >   	dma_fence_put(fence);
> > >   
> > >   err_out:
> > > +	xe_bo_put(bo);
> > > +
> > >   	return err;
> > >   }
> > >   
> > > diff --git a/drivers/gpu/drm/xe/xe_svm.h
> > > b/drivers/gpu/drm/xe/xe_svm.h
> > > index 8b72e91cc37d..3f432483a230 100644
> > > --- a/drivers/gpu/drm/xe/xe_svm.h
> > > +++ b/drivers/gpu/drm/xe/xe_svm.h
> > > @@ -18,6 +18,7 @@ struct xe_svm_range {
> > >   	struct list_head garbage_collector_link;
> > >   	u8 tile_present;
> > >   	u8 tile_invalidated;
> > > +	u8 migrated	:1;
> > >   };
> > >   
> > >   int xe_devm_add(struct xe_tile *tile, struct xe_mem_region
> > > *mr);
> > > -- 
> > > 2.34.1
> > >