[RFC PATCH 23/28] drm/xe: Add SVM VRAM migration

Mon Sep 2 13:02:33 UTC 2024

On Thu, Aug 29, 2024 at 09:48:11PM +0000, Matthew Brost wrote:
> On Thu, Aug 29, 2024 at 11:24:26AM +0200, Christian König wrote:
> >    Am 28.08.24 um 18:06 schrieb Daniel Vetter:
> > 
> 
> A lot to unpack here. Will try to address as much as I can in this
> single reply to both of you (Daniel, Christian).
> 
> > On Tue, Aug 27, 2024 at 07:48:56PM -0700, Matthew Brost wrote:
> > 
> > Migration is implemented with range granularity, with VRAM backing being
> > a VM private TTM BO (i.e., shares dma-resv with VM). The lifetime of the
> > TTM BO is limited to when the SVM range is in VRAM (i.e., when a VRAM
> > SVM range is migrated to SRAM, the TTM BO is destroyed).
> > 
> > The design choice for using TTM BO for VRAM backing store, as opposed to
> > direct buddy allocation, is as follows:
> > 
> > - DRM buddy allocations are not at page granularity, offering no
> >   advantage over a BO.
> > 
> > This one I'm not understanding.
> >
> >    Adding Arun as well. I couldn't understand it fully either, but maybe
> >    it's because the buddy allocator is more optimized for higher orders of
> >    allocations?
> > 
> 
> As currently written BO VRAM allocation resolves to a DRM buddy
> allocation. Unless there is memory pressure this is likely 1 buddy block
> if the allocation is aligned (SVM should basically also be doing aligned
> allocations which the common case of 2M at a time).
> 
> It was suggested in earlier revs by a colleague of mine allocating
> directly from DRM buddy pool provided a benefit wrt to freeing a page at
> a time. It doesn't given even if you bypass a BO most likely you are
> going to get 1 buddy block which is larger than a page. In either case
> you need to ref count the allocation or do some wild spliting algorithm
> (I don't want to that). Or alternatively write a new buddy allocator
> which easily code with freeing a page at a time (I don't want to that).
> 
> Lastly, the common case for getting dev_pagemap_ops.page_free is going
> to be consecutive calls spanning the entire allocation (e.g. eviction or
> CPU fault which triggers migration).
> 
> > 
> > 
> > - DRM buddy allocations do not solve locking inversion problems between
> >   mmap lock and dma-resv locks.
> > 
> > Which mmap -> dma_resv inversion? I've seen a lot ... I guess it also
> > matters hugely which migration path we're in, i.e. opportunistic
> > migration, cpu fault where we have to migrate or die, or when we run out
> > of vram and need to evict stuff to make space.
> > 
> >    Mhm I think the locking order between mmap lock and dma-resv lock is
> >    well defined since dma_resv_lockdep() was added.
> >
> 
> Yes. Also solved the inversion issue by using migrate_device_*. At one
> point had trylocking of mmap lock (still kinda there) but have agreed
> based on Daniel's feedback to rip the out.
>  
> > - Unified eviction is required (SVM VRAM and TTM BOs need to be able to
> >   evict each other).
> > 
> > So core mm handles this by just roughly equally shrinking everything.
> > Seems to work, and it has a pile of object shrinkers, and the page lru is
> > also split into page cache and anon memory.
> > 
> > I think you need to put in more justification that unified eviction is
> > required than just stating it, because a look at mm/ gives a very well
> > established counterexample.
> > 
> > 
> > - For exhaustive eviction [1], SVM VRAM allocations will almost certainly
> >   require a dma-resv.
> > 
> > So from the TTM side we need exhaustive eviction, or at least something a
> > bit more exhaustive than what ttm currently has. Note that i915-gem also
> > never really got to perfect exhaustive eviction, it's just a pile better
> > than ttm right now.
> > 
> >    Please define what exhaustive eviction should mean? I think I know what
> >    it is and I have been pushing TTM into the direction of solving this
> >    for years.
> >    The last missing puzzle piece is to use drm_exec for TTM evictions, but
> >    apart from that everything should work now.
> >    Regards,
> >    Christian.
> 
> I think Thomas has defined this in his replies. He also touches on our
> SVM design allows mixing user BO mappings and SVM mappings within the
> same VM. These need to be able to fairly evict each other. A dma-resv
> lock provides a level of fairness and ensure forward progress once a
> flavor of his series lands.
> 
> Also worth nothing in addition to user BOs, we have kernel BOs (for page
> table, user exec queues, etc...) in Xe which absolutely need to be able
> to evict something or the application dies.
> 
> > 
> > Now if there's also SVM VRAM managed on a page lru, TTM exhaustive
> 
> Page LRU isn't used for device pages from my understanding.

Yeah, we'd need to manage that ourselves. We could use exactly what the
core mm is doing, I haven't found anything that prohibits that. I think
core mm simply doesn't maintain zone device lru because it's not involved
in any device side access.

> > eviction is going to win because the shrinkers can only trylock dma_resv.
> > So this part works. It actually works so well on the system memory side
> > that if we're not careful we can trigger oom, because we're too good at
> > getting at all the memory.
> > 
> > SVM VRAM allocations otoh do not need exhaustive evictions. Or at least I
> > don't see why, because the idea is that thanks to gpu and cpu page faults,
> > you can always get out of a pinch by just trashing everything for a while
> > and migrating the handfull of available pages a lot.
> > 
> > 
> > - Likely allocation size is 2M which makes of size of BO (872)
> >   acceptable per allocation (872 / 2M == .0004158).
> > 
> > With this, using TTM BO for VRAM backing store seems to be an obvious
> > choice as it allows leveraging of the TTM eviction code.
> > 
> > Except it requires that you hold dma_resv, which brings in all kinds of
> > pain. And for eviction we really don't need a lot of synchronization, so a
> 
> Yes, but I think I have solved all those issues wrt to dma-resv.
> 
> What is really the alternative here? Teaching TTM to evict non-BO SVM
> allocations? Writing an SVM VRAM allocator which ends up looking also
> exactly like TTM and teaching it to evict TTM BOs? The later case we'd
> still need to grab dma-lock...

Yup.

> Do we write a new page based buddy allocator and wire that to TTM if SVM
> could possibly be used?

Well rebase it on top of a page array, like the buddy allocator in core mm
would be my first idea.

> This would be tons of code and not really sure what ROI is here.
> 
> > lot of that locking is not needed, unlike the case where we have a cpu
> > fault, where we absolutely need mmap_lock and all that to make sure we
> > fault in the right page.
> > 
> > But for eviction we only need to throw out some pages, if we're not
> > entirely precise with picking the right ones (or have no idea into which
> > vma they're all currently mapped into) it doesn't matter. That's why
> > migrate_device_pages doesn't care about any of that at all, it doesn't
> > need to by design. But by bo backing memory you drag in all that stuff
> > that's causing headacheds for eviction.
> > 
> > The only thing migration tries to do is remove all pte, and if that
> > succeeds, move the page. Specialized for the gpusvm case, looking at mm/
> > code as cheat sheet, we need roughly:
> > 
> > - reverse mapping structure like anon_vma. Except gpusvm can assume that
> >   there's currently only one gpu side mapping, so we can just stuff the
> >   gpusvm an va_address into the page, and protect it with the page lock.
> > 
> > - we need pagetable locks, so that we can manipulate pagetables (well
> >   specifically make ptes invalid) without taking any other locks.
> > 
> > - everyone else inserting or removing ptes for svm mappings also needs to
> >   lock the page, or we have races. This might be the hmm_range_fault races
> >   you're seeing when allowing vram pages, since I don't think there's
> >   anything else stopping the page lookup otherwise from succeeding.
> 
> AMD looks take the range->migration_mutex to prevent races.
> 
> > 
> > - we might also need to stuff migrate ptes into the gpu side, like the cpu
> >   does, to hold up refaults before the migration has finished. But I think
> >   those are only needed for anon memory in sram because there's no other
> >   way to find the right page than swap pte entries, of which migration
> >   entries are a special case.
> > 
> > - core code also expects us to handle the page refcount correctly for svm
> >   device memory, so we can't free the pages like normal bo pages either
> >   directly to drm_buddy.
> > 
> > Now typing this all up will look an awful lot like what you have, with the
> > dma_resv lock serving as the page lock and the pagetable lock. The only
> 
> dma_resv is indeed one of locks we need for page table updates (binds)
> as we allocate TTM BOs for page tables and we install fences for binds
> in dma-resv slots (certainly for non-SVM, might be able to drop that for
> SVM).

So the way this is solved on the core mm side is with two tricks:

- Page faults race entirely, and races are only resolved at pte insertion
  time when you acquire the pagetable lock. If there's anything else than
  a page-not-present pte, you've raced and bail out.

- Pagetables are allocated upfront, with the same trick: If someone else
  was faster, you bail out. Pagetables are never reclaimed for core mm
  code, so that avoids someone else nuking it meanwhile. At least while
  the vma mapping stays valid.

  I'm not sure we can entirely emulate that design with gpusvm.

And yeah there would be a substantial difference in code between the bo
and the svm world.

> > reason is that these locks are much smaller and nest within all the other
> > stuff going on and so avoid the inversion issues.
> > 
> > So one annoying part is that this is a lot of pointlessly looking typing.
> > The other is that it's full of races, because core mm really is yolo all
> > the way down. So lots of ways you lock the wrong page and fun stuff like
> > that, but the few cases that matter work:
> > 
> > - svm fault handling with hmm_range fault retries with mmu notifiers. Note
> >   that we need to have vram pages locked and the notifier retrie needs to
> >   be under the pagetable lock, or there's room to escape. At least that's
> >   what I came up with last time I thought it all through.

Correction: at fault time core mm does not lock pages. Just elevated
refcount is enough.

> We grab the gpusvm->notifier lock just be committing the bind and check
> for retry. If we need to retry we completely unwind all locks and
> restart the GPU fault.

Yeah core mm does the same.

> > - migrate_to_ram: it will hold a page reference which we know was the
> >   valid vram page when the cpu pte was locked, but it might not be it
> >   anymore. So we have to lock the page and check whether it's still gpu
> >   mapped, and if not retry the entire fault since most likey another
> >   migrate_to_ram has succeed meanwhile in parallel.
> > 
> > - for eviction we don't care, we might actually be migrating a page no one
> >   even wants anymore.
> > 
> > Now I think you can get all this done with the dma_resv lock and maybe the
> > bo refcount. But it does involve a tremendous amount of headaches and
> 
> I don't the headaches are too bad...
> 
> > impendence mismatch, because that's not how page faults and migrations
> > work in core mm.
> 
> Agree there is a bit of impendence mismatch but see above I can't really
> think of a better solution without thousands of lines of new code and
> invavise changes across the subsystem.
> 
> What I have in place appears to work with very little code changes to Xe
> and none to TTM. AMD also landed on a BO likely for similar reasons I
> have laid out.

My understanding of the history is that large chunks of the gpusvm
features where retrofitted, without updating the design. So I'm not
putting that much weight on amdkfd as a good solution, it's just the
obvious incremental solution.

Cheers, Sima
-- 
Simona Vetter
Software Engineer, Intel Corporation
http://blog.ffwll.ch