[RFC PATCH 23/28] drm/xe: Add SVM VRAM migration
Christian König
christian.koenig at amd.com
Thu Aug 29 14:30:11 UTC 2024
Am 29.08.24 um 11:53 schrieb Thomas Hellström:
> Hi, Christian,
>
> On Thu, 2024-08-29 at 11:24 +0200, Christian König wrote:
> ...
>
>>>> - Unified eviction is required (SVM VRAM and TTM BOs need to be
>>>> able to
>>>> evict each other).
>>> So core mm handles this by just roughly equally shrinking
>>> everything.
>>> Seems to work, and it has a pile of object shrinkers, and the page
>>> lru is
>>> also split into page cache and anon memory.
>>>
>>> I think you need to put in more justification that unified eviction
>>> is
>>> required than just stating it, because a look at mm/ gives a very
>>> well
>>> established counterexample.
>>>
>>>> - For exhaustive eviction [1], SVM VRAM allocations will almost
>>>> certainly
>>>> require a dma-resv.
>>> So from the TTM side we need exhaustive eviction, or at least
>>> something a
>>> bit more exhaustive than what ttm currently has. Note that i915-gem
>>> also
>>> never really got to perfect exhaustive eviction, it's just a pile
>>> better
>>> than ttm right now.
>> Please define what exhaustive eviction should mean? I think I know
>> what
>> it is and I have been pushing TTM into the direction of solving this
>> for
>> years.
> We internally refer to exhaustive eviction being a client is always
> guaranteed to eventually make progress in obtaining non-pinned vram,
> typically by incrementally locking and keeping dma-resvs across a
> single validation including validations during buffer object
> allocations.
>
>> The last missing puzzle piece is to use drm_exec for TTM evictions,
> and IMO keeping the dma-resv locks grabbed during eviction until at
> least one unit of progress (one validation) has succeeded.
Yes, exactly that. My guessed understanding was actually correct.
>
>> but
>> apart from that everything should work now.
>>
>>
>> Regards,
>> Christian.
> But as Sima pointed out in private communication, exhaustive eviction
> is not really needed for faulting to make (crawling) progress.
> Watermarks and VRAM trylock shrinking should suffice, since we're
> strictly only required to service a single gpu page granule at a time.
Yeah fault based memory management should be able to keep working as
long as the page isn't re-migrated before you make any progress.
Since the number of VRAM or system memory pages is very high that should
basically never happen.
> However, ordinary bo-based jobs would still like to be able to
> completely evict SVM vram. Whether that is important enough to strive
> for is ofc up for discussion.
Yes, exactly that. Felix, Alex, a bunch of other AMD folks and I came up
with the same conclusion at AMD internally as well.
Regards,
Christian.
>
> /Thomas
>
>
>
>>> Now if there's also SVM VRAM managed on a page lru, TTM exhaustive
>>> eviction is going to win because the shrinkers can only trylock
>>> dma_resv.
>>> So this part works. It actually works so well on the system memory
>>> side
>>> that if we're not careful we can trigger oom, because we're too
>>> good at
>>> getting at all the memory.
>>>
>>> SVM VRAM allocations otoh do not need exhaustive evictions. Or at
>>> least I
>>> don't see why, because the idea is that thanks to gpu and cpu page
>>> faults,
>>> you can always get out of a pinch by just trashing everything for a
>>> while
>>> and migrating the handfull of available pages a lot.
>>>
>>>> - Likely allocation size is 2M which makes of size of BO (872)
>>>> acceptable per allocation (872 / 2M == .0004158).
>>>>
>>>> With this, using TTM BO for VRAM backing store seems to be an
>>>> obvious
>>>> choice as it allows leveraging of the TTM eviction code.
>>> Except it requires that you hold dma_resv, which brings in all
>>> kinds of
>>> pain. And for eviction we really don't need a lot of
>>> synchronization, so a
>>> lot of that locking is not needed, unlike the case where we have a
>>> cpu
>>> fault, where we absolutely need mmap_lock and all that to make sure
>>> we
>>> fault in the right page.
>>>
>>> But for eviction we only need to throw out some pages, if we're not
>>> entirely precise with picking the right ones (or have no idea into
>>> which
>>> vma they're all currently mapped into) it doesn't matter. That's
>>> why
>>> migrate_device_pages doesn't care about any of that at all, it
>>> doesn't
>>> need to by design. But by bo backing memory you drag in all that
>>> stuff
>>> that's causing headacheds for eviction.
>>>
>>> The only thing migration tries to do is remove all pte, and if that
>>> succeeds, move the page. Specialized for the gpusvm case, looking
>>> at mm/
>>> code as cheat sheet, we need roughly:
>>>
>>> - reverse mapping structure like anon_vma. Except gpusvm can assume
>>> that
>>> there's currently only one gpu side mapping, so we can just
>>> stuff the
>>> gpusvm an va_address into the page, and protect it with the page
>>> lock.
>>>
>>> - we need pagetable locks, so that we can manipulate pagetables
>>> (well
>>> specifically make ptes invalid) without taking any other locks.
>>>
>>> - everyone else inserting or removing ptes for svm mappings also
>>> needs to
>>> lock the page, or we have races. This might be the
>>> hmm_range_fault races
>>> you're seeing when allowing vram pages, since I don't think
>>> there's
>>> anything else stopping the page lookup otherwise from
>>> succeeding.
>>>
>>> - we might also need to stuff migrate ptes into the gpu side, like
>>> the cpu
>>> does, to hold up refaults before the migration has finished. But
>>> I think
>>> those are only needed for anon memory in sram because there's no
>>> other
>>> way to find the right page than swap pte entries, of which
>>> migration
>>> entries are a special case.
>>>
>>> - core code also expects us to handle the page refcount correctly
>>> for svm
>>> device memory, so we can't free the pages like normal bo pages
>>> either
>>> directly to drm_buddy.
>>>
>>> Now typing this all up will look an awful lot like what you have,
>>> with the
>>> dma_resv lock serving as the page lock and the pagetable lock. The
>>> only
>>> reason is that these locks are much smaller and nest within all the
>>> other
>>> stuff going on and so avoid the inversion issues.
>>>
>>> So one annoying part is that this is a lot of pointlessly looking
>>> typing.
>>> The other is that it's full of races, because core mm really is
>>> yolo all
>>> the way down. So lots of ways you lock the wrong page and fun stuff
>>> like
>>> that, but the few cases that matter work:
>>>
>>> - svm fault handling with hmm_range fault retries with mmu
>>> notifiers. Note
>>> that we need to have vram pages locked and the notifier retrie
>>> needs to
>>> be under the pagetable lock, or there's room to escape. At least
>>> that's
>>> what I came up with last time I thought it all through.
>>>
>>> - migrate_to_ram: it will hold a page reference which we know was
>>> the
>>> valid vram page when the cpu pte was locked, but it might not be
>>> it
>>> anymore. So we have to lock the page and check whether it's
>>> still gpu
>>> mapped, and if not retry the entire fault since most likey
>>> another
>>> migrate_to_ram has succeed meanwhile in parallel.
>>>
>>> - for eviction we don't care, we might actually be migrating a page
>>> no one
>>> even wants anymore.
>>>
>>> Now I think you can get all this done with the dma_resv lock and
>>> maybe the
>>> bo refcount. But it does involve a tremendous amount of headaches
>>> and
>>> impendence mismatch, because that's not how page faults and
>>> migrations
>>> work in core mm.
>>>
>>> Cheers, Sima
>>>
>>>> Current migration policy is migrate any SVM range greater than or
>>>> equal
>>>> to 64k once.
>>>>
>>>> [1]https://patchwork.freedesktop.org/series/133643/
>>>>
>>>> Signed-off-by: Matthew Brostmatthew.brost at intel.com
>>>> ---
>>>> drivers/gpu/drm/xe/xe_svm.c | 81
>>>> ++++++++++++++++++++++++++++++++++++-
>>>> drivers/gpu/drm/xe/xe_svm.h | 1 +
>>>> 2 files changed, 81 insertions(+), 1 deletion(-)
>>>>
>>>> diff --git a/drivers/gpu/drm/xe/xe_svm.c
>>>> b/drivers/gpu/drm/xe/xe_svm.c
>>>> index 4372c02a341f..fd8987e0a506 100644
>>>> --- a/drivers/gpu/drm/xe/xe_svm.c
>>>> +++ b/drivers/gpu/drm/xe/xe_svm.c
>>>> @@ -217,8 +217,13 @@ static void xe_svm_invalidate(struct
>>>> drm_gpusvm *gpusvm,
>>>> static int __xe_svm_garbage_collector(struct xe_vm *vm,
>>>> struct xe_svm_range
>>>> *range)
>>>> {
>>>> + struct drm_gpusvm_ctx ctx = {};
>>>> struct dma_fence *fence;
>>>>
>>>> + /* Evict any pages holding references to vram allocation
>>>> */
>>>> + if (range->base.flags.partial_unmap && IS_DGFX(vm->xe))
>>>> + drm_gpusvm_migrate_to_sram(&vm->svm.gpusvm,
>>>> &range->base, &ctx);
>>>> +
>>>> xe_vm_lock(vm, false);
>>>> fence = xe_vm_range_unbind(vm, range);
>>>> xe_vm_unlock(vm);
>>>> @@ -504,21 +509,77 @@ static bool xe_svm_range_is_valid(struct
>>>> xe_svm_range *range,
>>>> return (range->tile_present & ~range->tile_invalidated)
>>>> & BIT(tile->id);
>>>> }
>>>>
>>>> +static struct xe_mem_region *tile_to_mr(struct xe_tile *tile)
>>>> +{
>>>> + return &tile->mem.vram;
>>>> +}
>>>> +
>>>> +static struct xe_bo *xe_svm_alloc_vram(struct xe_vm *vm, struct
>>>> xe_tile *tile,
>>>> + struct xe_svm_range
>>>> *range,
>>>> + const struct
>>>> drm_gpusvm_ctx *ctx)
>>>> +{
>>>> + struct xe_mem_region *mr = tile_to_mr(tile);
>>>> + struct drm_buddy_block *block;
>>>> + struct list_head *blocks;
>>>> + struct xe_bo *bo;
>>>> + ktime_t end = 0;
>>>> + int err;
>>>> +
>>>> +retry:
>>>> + xe_vm_lock(vm, false);
>>>> + bo = xe_bo_create(tile_to_xe(tile), tile, vm, range-
>>>>> base.va.end -
>>>> + range->base.va.start,
>>>> ttm_bo_type_device,
>>>> + XE_BO_FLAG_VRAM_IF_DGFX(tile) |
>>>> + XE_BO_FLAG_SYSTEM_ALLOC |
>>>> XE_BO_FLAG_SKIP_CLEAR);
>>>> + xe_vm_unlock(vm);
>>>> + if (IS_ERR(bo)) {
>>>> + err = PTR_ERR(bo);
>>>> + if (xe_vm_validate_should_retry(NULL, err,
>>>> &end))
>>>> + goto retry;
>>>> + return bo;
>>>> + }
>>>> +
>>>> + blocks = &to_xe_ttm_vram_mgr_resource(bo->ttm.resource)-
>>>>> blocks;
>>>> + list_for_each_entry(block, blocks, link)
>>>> + block->private = mr;
>>>> +
>>>> + /*
>>>> + * Take ref because as soon as
>>>> drm_gpusvm_migrate_to_vram succeeds the
>>>> + * creation ref can be dropped upon CPU fault or unmap.
>>>> + */
>>>> + xe_bo_get(bo);
>>>> +
>>>> + err = drm_gpusvm_migrate_to_vram(&vm->svm.gpusvm,
>>>> &range->base,
>>>> + bo, ctx);
>>>> + if (err) {
>>>> + xe_bo_put(bo); /* Local ref */
>>>> + xe_bo_put(bo); /* Creation ref */
>>>> + return ERR_PTR(err);
>>>> + }
>>>> +
>>>> + return bo;
>>>> +}
>>>> +
>>>> int xe_svm_handle_pagefault(struct xe_vm *vm, struct xe_vma
>>>> *vma,
>>>> struct xe_tile *tile, u64
>>>> fault_addr,
>>>> bool atomic)
>>>> {
>>>> - struct drm_gpusvm_ctx ctx = { .read_only =
>>>> xe_vma_read_only(vma), };
>>>> + struct drm_gpusvm_ctx ctx = { .read_only =
>>>> xe_vma_read_only(vma),
>>>> + .vram_possible = IS_DGFX(vm->xe), };
>>>> struct xe_svm_range *range;
>>>> struct drm_gpusvm_range *r;
>>>> struct drm_exec exec;
>>>> struct dma_fence *fence;
>>>> + struct xe_bo *bo = NULL;
>>>> ktime_t end = 0;
>>>> int err;
>>>>
>>>> lockdep_assert_held_write(&vm->lock);
>>>>
>>>> retry:
>>>> + xe_bo_put(bo);
>>>> + bo = NULL;
>>>> +
>>>> /* Always process UNMAPs first so view SVM ranges is
>>>> current */
>>>> err = xe_svm_garbage_collector(vm);
>>>> if (err)
>>>> @@ -534,6 +595,22 @@ int xe_svm_handle_pagefault(struct xe_vm
>>>> *vm, struct xe_vma *vma,
>>>> if (xe_svm_range_is_valid(range, tile))
>>>> return 0;
>>>>
>>>> + /* XXX: Add migration policy, for now migrate range once
>>>> */
>>>> + if (IS_DGFX(vm->xe) && !range->migrated &&
>>>> + range->base.flags.migrate_vram &&
>>>> + (range->base.va.end - range->base.va.start) >=
>>>> SZ_64K) {
>>>> + range->migrated = true;
>>>> +
>>>> + bo = xe_svm_alloc_vram(vm, tile, range, &ctx);
>>>> + if (IS_ERR(bo)) {
>>>> + drm_info(&vm->xe->drm,
>>>> + "VRAM allocation failed,
>>>> falling back to retrying, asid=%u, errno %ld\n",
>>>> + vm->usm.asid, PTR_ERR(bo));
>>>> + bo = NULL;
>>>> + goto retry;
>>>> + }
>>>> + }
>>>> +
>>>> err = drm_gpusvm_range_get_pages(&vm->svm.gpusvm, r,
>>>> &ctx);
>>>> if (err == -EFAULT || err == -EPERM) /* Corner where
>>>> CPU mappings have change */
>>>> goto retry;
>>>> @@ -567,6 +644,8 @@ int xe_svm_handle_pagefault(struct xe_vm *vm,
>>>> struct xe_vma *vma,
>>>> dma_fence_put(fence);
>>>>
>>>> err_out:
>>>> + xe_bo_put(bo);
>>>> +
>>>> return err;
>>>> }
>>>>
>>>> diff --git a/drivers/gpu/drm/xe/xe_svm.h
>>>> b/drivers/gpu/drm/xe/xe_svm.h
>>>> index 8b72e91cc37d..3f432483a230 100644
>>>> --- a/drivers/gpu/drm/xe/xe_svm.h
>>>> +++ b/drivers/gpu/drm/xe/xe_svm.h
>>>> @@ -18,6 +18,7 @@ struct xe_svm_range {
>>>> struct list_head garbage_collector_link;
>>>> u8 tile_present;
>>>> u8 tile_invalidated;
>>>> + u8 migrated :1;
>>>> };
>>>>
>>>> int xe_devm_add(struct xe_tile *tile, struct xe_mem_region
>>>> *mr);
>>>> --
>>>> 2.34.1
>>>>
More information about the dri-devel
mailing list