[RFC PATCH 23/28] drm/xe: Add SVM VRAM migration

Thu Aug 29 09:24:26 UTC 2024

Am 28.08.24 um 18:06 schrieb Daniel Vetter:
> On Tue, Aug 27, 2024 at 07:48:56PM -0700, Matthew Brost wrote:
>> Migration is implemented with range granularity, with VRAM backing being
>> a VM private TTM BO (i.e., shares dma-resv with VM). The lifetime of the
>> TTM BO is limited to when the SVM range is in VRAM (i.e., when a VRAM
>> SVM range is migrated to SRAM, the TTM BO is destroyed).
>>
>> The design choice for using TTM BO for VRAM backing store, as opposed to
>> direct buddy allocation, is as follows:
>>
>> - DRM buddy allocations are not at page granularity, offering no
>>    advantage over a BO.
> This one I'm not understanding.

Adding Arun as well. I couldn't understand it fully either, but maybe 
it's because the buddy allocator is more optimized for higher orders of 
allocations?

>
>> - DRM buddy allocations do not solve locking inversion problems between
>>    mmap lock and dma-resv locks.
> Which mmap -> dma_resv inversion? I've seen a lot ... I guess it also
> matters hugely which migration path we're in, i.e. opportunistic
> migration, cpu fault where we have to migrate or die, or when we run out
> of vram and need to evict stuff to make space.

Mhm I think the locking order between mmap lock and dma-resv lock is 
well defined since dma_resv_lockdep() was added.

>> - Unified eviction is required (SVM VRAM and TTM BOs need to be able to
>>    evict each other).
> So core mm handles this by just roughly equally shrinking everything.
> Seems to work, and it has a pile of object shrinkers, and the page lru is
> also split into page cache and anon memory.
>
> I think you need to put in more justification that unified eviction is
> required than just stating it, because a look at mm/ gives a very well
> established counterexample.
>
>> - For exhaustive eviction [1], SVM VRAM allocations will almost certainly
>>    require a dma-resv.
> So from the TTM side we need exhaustive eviction, or at least something a
> bit more exhaustive than what ttm currently has. Note that i915-gem also
> never really got to perfect exhaustive eviction, it's just a pile better
> than ttm right now.

Please define what exhaustive eviction should mean? I think I know what 
it is and I have been pushing TTM into the direction of solving this for 
years.

The last missing puzzle piece is to use drm_exec for TTM evictions, but 
apart from that everything should work now.

Regards,
Christian.

> Now if there's also SVM VRAM managed on a page lru, TTM exhaustive
> eviction is going to win because the shrinkers can only trylock dma_resv.
> So this part works. It actually works so well on the system memory side
> that if we're not careful we can trigger oom, because we're too good at
> getting at all the memory.
>
> SVM VRAM allocations otoh do not need exhaustive evictions. Or at least I
> don't see why, because the idea is that thanks to gpu and cpu page faults,
> you can always get out of a pinch by just trashing everything for a while
> and migrating the handfull of available pages a lot.
>
>> - Likely allocation size is 2M which makes of size of BO (872)
>>    acceptable per allocation (872 / 2M == .0004158).
>>
>> With this, using TTM BO for VRAM backing store seems to be an obvious
>> choice as it allows leveraging of the TTM eviction code.
> Except it requires that you hold dma_resv, which brings in all kinds of
> pain. And for eviction we really don't need a lot of synchronization, so a
> lot of that locking is not needed, unlike the case where we have a cpu
> fault, where we absolutely need mmap_lock and all that to make sure we
> fault in the right page.
>
> But for eviction we only need to throw out some pages, if we're not
> entirely precise with picking the right ones (or have no idea into which
> vma they're all currently mapped into) it doesn't matter. That's why
> migrate_device_pages doesn't care about any of that at all, it doesn't
> need to by design. But by bo backing memory you drag in all that stuff
> that's causing headacheds for eviction.
>
> The only thing migration tries to do is remove all pte, and if that
> succeeds, move the page. Specialized for the gpusvm case, looking at mm/
> code as cheat sheet, we need roughly:
>
> - reverse mapping structure like anon_vma. Except gpusvm can assume that
>    there's currently only one gpu side mapping, so we can just stuff the
>    gpusvm an va_address into the page, and protect it with the page lock.
>
> - we need pagetable locks, so that we can manipulate pagetables (well
>    specifically make ptes invalid) without taking any other locks.
>
> - everyone else inserting or removing ptes for svm mappings also needs to
>    lock the page, or we have races. This might be the hmm_range_fault races
>    you're seeing when allowing vram pages, since I don't think there's
>    anything else stopping the page lookup otherwise from succeeding.
>
> - we might also need to stuff migrate ptes into the gpu side, like the cpu
>    does, to hold up refaults before the migration has finished. But I think
>    those are only needed for anon memory in sram because there's no other
>    way to find the right page than swap pte entries, of which migration
>    entries are a special case.
>
> - core code also expects us to handle the page refcount correctly for svm
>    device memory, so we can't free the pages like normal bo pages either
>    directly to drm_buddy.
>
> Now typing this all up will look an awful lot like what you have, with the
> dma_resv lock serving as the page lock and the pagetable lock. The only
> reason is that these locks are much smaller and nest within all the other
> stuff going on and so avoid the inversion issues.
>
> So one annoying part is that this is a lot of pointlessly looking typing.
> The other is that it's full of races, because core mm really is yolo all
> the way down. So lots of ways you lock the wrong page and fun stuff like
> that, but the few cases that matter work:
>
> - svm fault handling with hmm_range fault retries with mmu notifiers. Note
>    that we need to have vram pages locked and the notifier retrie needs to
>    be under the pagetable lock, or there's room to escape. At least that's
>    what I came up with last time I thought it all through.
>
> - migrate_to_ram: it will hold a page reference which we know was the
>    valid vram page when the cpu pte was locked, but it might not be it
>    anymore. So we have to lock the page and check whether it's still gpu
>    mapped, and if not retry the entire fault since most likey another
>    migrate_to_ram has succeed meanwhile in parallel.
>
> - for eviction we don't care, we might actually be migrating a page no one
>    even wants anymore.
>
> Now I think you can get all this done with the dma_resv lock and maybe the
> bo refcount. But it does involve a tremendous amount of headaches and
> impendence mismatch, because that's not how page faults and migrations
> work in core mm.
>
> Cheers, Sima
>
>> Current migration policy is migrate any SVM range greater than or equal
>> to 64k once.
>>
>> [1]https://patchwork.freedesktop.org/series/133643/
>>
>> Signed-off-by: Matthew Brostmatthew.brost at intel.com
>> ---
>>   drivers/gpu/drm/xe/xe_svm.c | 81 ++++++++++++++++++++++++++++++++++++-
>>   drivers/gpu/drm/xe/xe_svm.h |  1 +
>>   2 files changed, 81 insertions(+), 1 deletion(-)
>>
>> diff --git a/drivers/gpu/drm/xe/xe_svm.c b/drivers/gpu/drm/xe/xe_svm.c
>> index 4372c02a341f..fd8987e0a506 100644
>> --- a/drivers/gpu/drm/xe/xe_svm.c
>> +++ b/drivers/gpu/drm/xe/xe_svm.c
>> @@ -217,8 +217,13 @@ static void xe_svm_invalidate(struct drm_gpusvm *gpusvm,
>>   static int __xe_svm_garbage_collector(struct xe_vm *vm,
>>   				      struct xe_svm_range *range)
>>   {
>> +	struct drm_gpusvm_ctx ctx = {};
>>   	struct dma_fence *fence;
>>   
>> +	/* Evict any pages holding references to vram allocation */
>> +	if (range->base.flags.partial_unmap && IS_DGFX(vm->xe))
>> +		drm_gpusvm_migrate_to_sram(&vm->svm.gpusvm, &range->base, &ctx);
>> +
>>   	xe_vm_lock(vm, false);
>>   	fence = xe_vm_range_unbind(vm, range);
>>   	xe_vm_unlock(vm);
>> @@ -504,21 +509,77 @@ static bool xe_svm_range_is_valid(struct xe_svm_range *range,
>>   	return (range->tile_present & ~range->tile_invalidated) & BIT(tile->id);
>>   }
>>   
>> +static struct xe_mem_region *tile_to_mr(struct xe_tile *tile)
>> +{
>> +	return &tile->mem.vram;
>> +}
>> +
>> +static struct xe_bo *xe_svm_alloc_vram(struct xe_vm *vm, struct xe_tile *tile,
>> +				       struct xe_svm_range *range,
>> +				       const struct drm_gpusvm_ctx *ctx)
>> +{
>> +	struct xe_mem_region *mr = tile_to_mr(tile);
>> +	struct drm_buddy_block *block;
>> +	struct list_head *blocks;
>> +	struct xe_bo *bo;
>> +	ktime_t end = 0;
>> +	int err;
>> +
>> +retry:
>> +	xe_vm_lock(vm, false);
>> +	bo = xe_bo_create(tile_to_xe(tile), tile, vm, range->base.va.end -
>> +			  range->base.va.start, ttm_bo_type_device,
>> +			  XE_BO_FLAG_VRAM_IF_DGFX(tile) |
>> +			  XE_BO_FLAG_SYSTEM_ALLOC | XE_BO_FLAG_SKIP_CLEAR);
>> +	xe_vm_unlock(vm);
>> +	if (IS_ERR(bo)) {
>> +		err = PTR_ERR(bo);
>> +		if (xe_vm_validate_should_retry(NULL, err, &end))
>> +			goto retry;
>> +		return bo;
>> +	}
>> +
>> +	blocks = &to_xe_ttm_vram_mgr_resource(bo->ttm.resource)->blocks;
>> +	list_for_each_entry(block, blocks, link)
>> +		block->private = mr;
>> +
>> +	/*
>> +	 * Take ref because as soon as drm_gpusvm_migrate_to_vram succeeds the
>> +	 * creation ref can be dropped upon CPU fault or unmap.
>> +	 */
>> +	xe_bo_get(bo);
>> +
>> +	err = drm_gpusvm_migrate_to_vram(&vm->svm.gpusvm, &range->base,
>> +					 bo, ctx);
>> +	if (err) {
>> +		xe_bo_put(bo);	/* Local ref */
>> +		xe_bo_put(bo);	/* Creation ref */
>> +		return ERR_PTR(err);
>> +	}
>> +
>> +	return bo;
>> +}
>> +
>>   int xe_svm_handle_pagefault(struct xe_vm *vm, struct xe_vma *vma,
>>   			    struct xe_tile *tile, u64 fault_addr,
>>   			    bool atomic)
>>   {
>> -	struct drm_gpusvm_ctx ctx = { .read_only = xe_vma_read_only(vma), };
>> +	struct drm_gpusvm_ctx ctx = { .read_only = xe_vma_read_only(vma),
>> +		.vram_possible = IS_DGFX(vm->xe), };
>>   	struct xe_svm_range *range;
>>   	struct drm_gpusvm_range *r;
>>   	struct drm_exec exec;
>>   	struct dma_fence *fence;
>> +	struct xe_bo *bo = NULL;
>>   	ktime_t end = 0;
>>   	int err;
>>   
>>   	lockdep_assert_held_write(&vm->lock);
>>   
>>   retry:
>> +	xe_bo_put(bo);
>> +	bo = NULL;
>> +
>>   	/* Always process UNMAPs first so view SVM ranges is current */
>>   	err = xe_svm_garbage_collector(vm);
>>   	if (err)
>> @@ -534,6 +595,22 @@ int xe_svm_handle_pagefault(struct xe_vm *vm, struct xe_vma *vma,
>>   	if (xe_svm_range_is_valid(range, tile))
>>   		return 0;
>>   
>> +	/* XXX: Add migration policy, for now migrate range once */
>> +	if (IS_DGFX(vm->xe) && !range->migrated &&
>> +	    range->base.flags.migrate_vram &&
>> +	    (range->base.va.end - range->base.va.start) >= SZ_64K) {
>> +		range->migrated = true;
>> +
>> +		bo = xe_svm_alloc_vram(vm, tile, range, &ctx);
>> +		if (IS_ERR(bo)) {
>> +			drm_info(&vm->xe->drm,
>> +				 "VRAM allocation failed, falling back to retrying, asid=%u, errno %ld\n",
>> +				 vm->usm.asid, PTR_ERR(bo));
>> +			bo = NULL;
>> +			goto retry;
>> +		}
>> +	}
>> +
>>   	err = drm_gpusvm_range_get_pages(&vm->svm.gpusvm, r, &ctx);
>>   	if (err == -EFAULT || err == -EPERM)	/* Corner where CPU mappings have change */
>>   	       goto retry;
>> @@ -567,6 +644,8 @@ int xe_svm_handle_pagefault(struct xe_vm *vm, struct xe_vma *vma,
>>   	dma_fence_put(fence);
>>   
>>   err_out:
>> +	xe_bo_put(bo);
>> +
>>   	return err;
>>   }
>>   
>> diff --git a/drivers/gpu/drm/xe/xe_svm.h b/drivers/gpu/drm/xe/xe_svm.h
>> index 8b72e91cc37d..3f432483a230 100644
>> --- a/drivers/gpu/drm/xe/xe_svm.h
>> +++ b/drivers/gpu/drm/xe/xe_svm.h
>> @@ -18,6 +18,7 @@ struct xe_svm_range {
>>   	struct list_head garbage_collector_link;
>>   	u8 tile_present;
>>   	u8 tile_invalidated;
>> +	u8 migrated	:1;
>>   };
>>   
>>   int xe_devm_add(struct xe_tile *tile, struct xe_mem_region *mr);
>> -- 
>> 2.34.1
>>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://lists.freedesktop.org/archives/dri-devel/attachments/20240829/1763470b/attachment-0001.htm>