[PATCH 00/35] Add HMM-based SVM memory manager to KFD

Thu Jan 14 10:39:15 UTC 2021

On Thu, Jan 14, 2021 at 10:26 AM Daniel Vetter <daniel at ffwll.ch> wrote:
>
> On Thu, Jan 14, 2021 at 4:27 AM Jerome Glisse <jglisse at redhat.com> wrote:
> >
> > On Wed, Jan 13, 2021 at 09:31:11PM +0100, Daniel Vetter wrote:
> > > On Wed, Jan 13, 2021 at 5:56 PM Jerome Glisse <jglisse at redhat.com> wrote:
> > > > On Fri, Jan 08, 2021 at 03:40:07PM +0100, Daniel Vetter wrote:
> > > > > On Thu, Jan 07, 2021 at 11:25:41AM -0500, Felix Kuehling wrote:
> > > > > > Am 2021-01-07 um 4:23 a.m. schrieb Daniel Vetter:
> > > > > > > On Wed, Jan 06, 2021 at 10:00:52PM -0500, Felix Kuehling wrote:
> > > > > > >> This is the first version of our HMM based shared virtual memory manager
> > > > > > >> for KFD. There are still a number of known issues that we're working through
> > > > > > >> (see below). This will likely lead to some pretty significant changes in
> > > > > > >> MMU notifier handling and locking on the migration code paths. So don't
> > > > > > >> get hung up on those details yet.
> > > > > > >>
> > > > > > >> But I think this is a good time to start getting feedback. We're pretty
> > > > > > >> confident about the ioctl API, which is both simple and extensible for the
> > > > > > >> future. (see patches 4,16) The user mode side of the API can be found here:
> > > > > > >> https://github.com/RadeonOpenCompute/ROCT-Thunk-Interface/blob/fxkamd/hmm-wip/src/svm.c
> > > > > > >>
> > > > > > >> I'd also like another pair of eyes on how we're interfacing with the GPU VM
> > > > > > >> code in amdgpu_vm.c (see patches 12,13), retry page fault handling (24,25),
> > > > > > >> and some retry IRQ handling changes (32).
> > > > > > >>
> > > > > > >>
> > > > > > >> Known issues:
> > > > > > >> * won't work with IOMMU enabled, we need to dma_map all pages properly
> > > > > > >> * still working on some race conditions and random bugs
> > > > > > >> * performance is not great yet
> > > > > > > Still catching up, but I think there's another one for your list:
> > > > > > >
> > > > > > >  * hmm gpu context preempt vs page fault handling. I've had a short
> > > > > > >    discussion about this one with Christian before the holidays, and also
> > > > > > >    some private chats with Jerome. It's nasty since no easy fix, much less
> > > > > > >    a good idea what's the best approach here.
> > > > > >
> > > > > > Do you have a pointer to that discussion or any more details?
> > > > >
> > > > > Essentially if you're handling an hmm page fault from the gpu, you can
> > > > > deadlock by calling dma_fence_wait on a (chain of, possibly) other command
> > > > > submissions or compute contexts with dma_fence_wait. Which deadlocks if
> > > > > you can't preempt while you have that page fault pending. Two solutions:
> > > > >
> > > > > - your hw can (at least for compute ctx) preempt even when a page fault is
> > > > >   pending
> > > > >
> > > > > - lots of screaming in trying to come up with an alternate solution. They
> > > > >   all suck.
> > > > >
> > > > > Note that the dma_fence_wait is hard requirement, because we need that for
> > > > > mmu notifiers and shrinkers, disallowing that would disable dynamic memory
> > > > > management. Which is the current "ttm is self-limited to 50% of system
> > > > > memory" limitation Christian is trying to lift. So that's really not
> > > > > a restriction we can lift, at least not in upstream where we need to also
> > > > > support old style hardware which doesn't have page fault support and
> > > > > really has no other option to handle memory management than
> > > > > dma_fence_wait.
> > > > >
> > > > > Thread was here:
> > > > >
> > > > > https://lore.kernel.org/dri-devel/CAKMK7uGgoeF8LmFBwWh5mW1k4xWjuUh3hdSFpVH1NBM7K0=edA@mail.gmail.com/
> > > > >
> > > > > There's a few ways to resolve this (without having preempt-capable
> > > > > hardware), but they're all supremely nasty.
> > > > > -Daniel
> > > > >
> > > >
> > > > I had a new idea, i wanted to think more about it but have not yet,
> > > > anyway here it is. Adding a new callback to dma fence which ask the
> > > > question can it dead lock ? Any time a GPU driver has pending page
> > > > fault (ie something calling into the mm) it answer yes, otherwise
> > > > no. The GPU shrinker would ask the question before waiting on any
> > > > dma-fence and back of if it gets yes. Shrinker can still try many
> > > > dma buf object for which it does not get a yes on associated fence.
> > >
> > > Having that answer on a given fence isn't enough, you still need to
> > > forward that information through the entire dependency graph, across
> > > drivers. That's the hard part, since that dependency graph is very
> > > implicit in the code, and we'd need to first roll it out across all
> > > drivers.
> >
> > Here i am saying do not wait on fence for which you are not sure.
> > Only wait on fence for which you are 100% certain you can not dead
> > lock. So if you can never be sure on dma fence then never wait on
> > dma-fence in the shrinker. However most driver should have enough
> > information in their shrinker to know if it is safe to wait on
> > fence internal to their device driver (and also know if any of
> > those fence has implicit outside dependency). So first implementation
> > would be to say always deadlock and then having each driver build
> > confidence into what it can ascertain.
>
> I just don't think that actually works in practice:
>
> - on a single gpu you can't wait for vk/gl due to shared CUs, so only
> sdma and uvd are left (or whatever else pure fixed function)
>
> - for multi-gpu you get the guessing game of what leaks across gpus
> and what doesn't. With p2p dma-buf we're now leaking dma_fence across
> gpus even when there's no implicit syncing by userspace (although for
> amdgpu this is tricky since iirc it still lacks the flag to let
> userspace decide this, so this is more for other drivers).
>
> - you don't just need to guarantee that there's no dma_fence
> dependency going back to you, you also need to make sure there's no
> other depedency chain through locks or whatever that closes the loop.
> And since your proposal here is against the dma_fence lockdep
> annotations we have now, lockdep won't help you (and let's be honest,
> review doesn't catch this stuff either, so it's up to hangs in
> production to catch this stuff)
>
> - you still need the full dependency graph within the driver, and only
> i915 scheduler has that afaik. And I'm not sure implementing that was
> a bright idea
>
> - assuming it's a deadlock by default means all gl/vk memory is
> pinned. That's not nice, plus in additional you need hacks like ttm's
> "max 50% of system memory" to paper over the worst fallout, which
> Christian is trying to lift. I really do think we need to be able to
> move towards more dynamic memory management, not less.

Forgot one issue:

- somehow you need to transport the knowledge that you're in the gpu
fault repair path of a specific engine down to shrinkers/mmu notifiers
and all that. And it needs to be fairly specific, otherwise it just
amounts again to "no more dma_fence_wait allowed".

-Daniel

> So in the end you're essentially disabling shrinking/eviction of other
> gpu tasks, and I don't think that works. I really think the only two
> realistic options are
> - guarantee forward progress of other dma_fence (hw preemption,
> reserved CUs, or whatever else you have)
> - guarantee there's not a single offending dma_fence active in the
> system that could cause problems
>
> Hand-waving that in theory we could track the dependencies and that in
> theory we could do some deadlock avoidance of some sorts about that
> just doesn't look like a pragmatic&practical solution to me here. It
> feels about as realistic as just creating a completely new memory
> management model that sidesteps the entire dma_fence issues we have
> due to mixing up kernel memory management and userspace sync fences in
> one thing.
>
> Cheers, Daniel
>
> > > > This does not solve the mmu notifier case, for this you would just
> > > > invalidate the gem userptr object (with a flag but not releasing the
> > > > page refcount) but you would not wait for the GPU (ie no dma fence
> > > > wait in that code path anymore). The userptr API never really made
> > > > the contract that it will always be in sync with the mm view of the
> > > > world so if different page get remapped to same virtual address
> > > > while GPU is still working with the old pages it should not be an
> > > > issue (it would not be in our usage of userptr for compositor and
> > > > what not).
> > > >
> > > > Maybe i overlook something there.
> > >
> > > tbh I'm never really clear on how much exactly we need, and whether
> > > maybe the new pin/unpin api should fix it all.
> >
> > pin/unpin is not a solution it is to fix something with GUP (where
> > we need to know if a page is GUPed or not). GUP should die longterm
> > so anything using GUP (pin/unpin falls into that) should die longterm.
> > Pining memory is bad period (it just breaks too much mm and it is
> > unsolvable for things like mremap, splice, ...).
> >
> > Cheers,
> > Jérôme
> >
>
>
> --
> Daniel Vetter
> Software Engineer, Intel Corporation
> http://blog.ffwll.ch

-- 
Daniel Vetter
Software Engineer, Intel Corporation
http://blog.ffwll.ch