[RFC PATCH 23/28] drm/xe: Add SVM VRAM migration

Mon Sep 2 12:48:55 UTC 2024

On Thu, Aug 29, 2024 at 10:12:53PM +0000, Matthew Brost wrote:
> On Thu, Aug 29, 2024 at 01:02:54PM +0200, Daniel Vetter wrote:
> > On Thu, Aug 29, 2024 at 11:53:58AM +0200, Thomas Hellström wrote:
> > > But as Sima pointed out in private communication, exhaustive eviction
> > > is not really needed for faulting to make (crawling) progress.
> > > Watermarks and VRAM trylock shrinking should suffice, since we're
> > > strictly only required to service a single gpu page granule at a time.
> > > 
> > > However, ordinary bo-based jobs would still like to be able to
> > > completely evict SVM vram. Whether that is important enough to strive
> > > for is ofc up for discussion.
> > 
> > My take is that you don't win anything for exhaustive eviction by having
> > the dma_resv somewhere in there for svm allocations. Roughly for split lru
> > world, where svm ignores bo/dma_resv:
> > 
> > When evicting vram from the ttm side we'll fairly switch between selecting
> > bo and throwing out svm pages. With drm_exec/ww_acquire_ctx selecting bo
> > will eventually succeed in vacuuming up everything (with a few retries
> > perhaps, if we're not yet at the head of the ww ticket queue).
> > 
> > svm pages we need to try to evict anyway - there's no guarantee, becaue
> > the core mm might be holding temporary page references (which block
> 
> Yea, but think you can could kill the app then - not suggesting we
> should but could. To me this is akin to a CPU fault and not being able
> to migrate the device pages - the migration layer doc says when this
> happens kick this to user space and segfault the app.
> 
> My last patch in the series adds some asserts to see if this ever
> happens, thus far never. If it occurs we could gracefully handle it by
> aborting the migration I guess... I think the user really needs to
> something a bit crazy to trigger this condition - I don't think the core
> randomly grabs refs to device pages but could be wrong.

I think it does :-/

If you read do_swap_page around ->migrate_to_ram:

	get_page(vmf->page);
	pte_unmap_unlock(vmf->pte, vmf->ptl);
	ret = vmf->page->pgmap->ops->migrate_to_ram(vmf);
	put_page(vmf->page);

Also the migrate code itself does lock pages. So unless we toss in
additional locking on top of what core mm does (which I think should be
enough to cover migration), migrations will temporarily fail. And this is
just for multiple threads trying to get the same page back to sram, which
I think is a case we should support because the application did nothing
wrong.

> > migration) or have the page locked (which also block the migration). But
> > as long as those two steps succeed, we'll win and get the pages. There
> > might be some thrashing against concurrent svm faults stealing them again,
> > but they have a disadvantage since they can't steal dma_resv_locked bo.
> > And if it's still too much we can stall them in the page allocator.
> > 
> > So it's not entirely reliable, but should be close enough.
> > 
> > Now for bo based svm the picture isn't any different, because holding
> > dma_resv is not actually enough to migrate svm mappings. We still need to
> > hope there's no temporary page references around, and we still need to
> > succeed at locking the page. And the migration code only does trylocks,
> > because that's it's deadlock prevent algorithm if different migrations
> > needing the same set of pages, but acquiring them in a different order. So
> > we win nothing.
> 
> Ok, maybe my statement above is false...
> 
> Wouldn't be the only time this falls is if another migration is in
> flight (e.g. CPU fault) and they race? Then the eviction will naturally
> happen via refcount being dropped from the other migration. I guess I
> likely need to update my eviction code to not free the TTM resource if
> all pages are not migrated.

Yeah. And additionally core mm relies on some amount of Good Luck here,
plus the assumption that at least falling back to a single page/folio will
work out. At least eventually ...

The trouble is if your design assumes you can migrate an entire block,
because then if threads hammer that range in different orders you'll never
make forward progress. Because the core mm code doesn't have a fancy ww
locking scheme to get out of this, but only uses trylock, plus the
assumption that falling back to a single page will work out eventually.

Wrt TTM resource refcounting, I think that all looks ok. But maybe I
checked the wrong things.

> > Worse, if dma_resv does actually hold up svm migration and reclaim, then
> > we potentially deadlock because that lock is for a bigger range than
> > individual pages (or folios). And the core mm assumes that it can get out
> > of a deadlock bind by (at least stochastically) eventually succeeding in
> > acquiring/locking down a single page.
> > 
> > This means we cannot use dma_resv tricks to give the ttm world an
> > advantage in exhaustive eviction against concurrent svm faults. Or at
> > least not more than we can do without by just stalling svm faults that
> > need to allocate gpu memory (but that must happen without holding locks or
> > we're busted).
> > 
> 
> I'm a little lost here on the deadlock case. Do you mean when we try to
> evict SVM BO we trigger reclaim by allocating system pages and can
> deadlock? Doesn't TTM already have this dependency when evicting non-SVM
> BOs?

So you can have multiple cpu threads hammering a given svm range. And
thanks to the lols of mremap and fork each of them can have a different
view of that range (they are all obviously different processes from the
one that has created the gpusvm binding). And if you try to migrate, they
might all grab the pages in different orders, which can deadlock.

That's why there's so much retrying and also why core mm only does trylock
on pages if it grabs an entire pile.

Now if you have a lock that nests within the page lock you need to trylock
it, or it deadlocks. Which kinda defeats the point of having a bigger lock
and moving the entire bo as a unit.

But if that is outside of the page lock (like amdgpu), you still have the
issue of the elevated page reference from do_swap_page. Which also blocks
migration.

Note that neither is a hard deadlock, as in lockdep complaints, because
they're all retrying anyway. They're more like lifelocks, and the bigger
your pile of pages the more likely that you'll always have a failed page
and need to abort and retry. Which results in threads spinning forever.

> > So the only benefit I'm seeing is the unified lru, which I'm not sure is
> > worth it. There's also a bit a lru design tension here, because for the bo
> 
> Well also not rewriting the world...

Yeah it's tough. I'm still at the "understanding all the tradeoffs" stage,
just to make that clear.
-Sima

> Matt
> 
> > world we want objects that are locked to stay on the lru, so that the
> > competing processes can figure out who has the winning ww ticket. The core
> > mm design otoh does isolate pages and remove them from the lru when
> > they're acquired, so that they don't gunk up other processes from trying
> > to make forward progress and are better hidden. Which reduces temporary
> > page references (from lru walk) preventing migration and stuff like that.
> > 
> > Cheers, Sima
> > -- 
> > Daniel Vetter
> > Software Engineer, Intel Corporation
> > http://blog.ffwll.ch

-- 
Simona Vetter
Software Engineer, Intel Corporation
http://blog.ffwll.ch