[PATCH 1/2] drm/ttm: rework ttm_tt page limit v2

Thu Dec 17 18:19:07 UTC 2020

On Thu, Dec 17, 2020 at 7:09 PM Jerome Glisse <jglisse at redhat.com> wrote:
>
> Adding few folks on cc just to raise awareness and so that i
> could get corrected if i said anything wrong.
>
> On Thu, Dec 17, 2020 at 04:45:55PM +0100, Daniel Vetter wrote:
> > On Thu, Dec 17, 2020 at 4:36 PM Christian König
> > <christian.koenig at amd.com> wrote:
> > > Am 17.12.20 um 16:26 schrieb Daniel Vetter:
> > > > On Thu, Dec 17, 2020 at 4:10 PM Christian König
> > > > <christian.koenig at amd.com> wrote:
> > > >> Am 17.12.20 um 15:36 schrieb Daniel Vetter:
> > > >>> On Thu, Dec 17, 2020 at 2:46 PM Christian König
> > > >>> <ckoenig.leichtzumerken at gmail.com> wrote:
> > > >>>> Am 16.12.20 um 16:09 schrieb Daniel Vetter:
> > > >>>>> On Wed, Dec 16, 2020 at 03:04:26PM +0100, Christian König wrote:
> > > >>>>> [SNIP]
> > > >>>>>> +
> > > >>>>>> +/* As long as pages are available make sure to release at least one */
> > > >>>>>> +static unsigned long ttm_tt_shrinker_scan(struct shrinker *shrink,
> > > >>>>>> +                                      struct shrink_control *sc)
> > > >>>>>> +{
> > > >>>>>> +    struct ttm_operation_ctx ctx = {
> > > >>>>>> +            .no_wait_gpu = true
> > > >>>>> Iirc there's an eventual shrinker limit where it gets desperate. I think
> > > >>>>> once we hit that, we should allow gpu waits. But it's not passed to
> > > >>>>> shrinkers for reasons, so maybe we should have a second round that tries
> > > >>>>> to more actively shrink objects if we fell substantially short of what
> > > >>>>> reclaim expected us to do?
> > > >>>> I think we should try to avoid waiting for the GPU in the shrinker callback.
> > > >>>>
> > > >>>> When we get HMM we will have cases where the shrinker is called from
> > > >>>> there and we can't wait for the GPU then without causing deadlocks.
> > > >>> Uh that doesn't work. Also, the current rules are that you are allowed
> > > >>> to call dma_fence_wait from shrinker callbacks, so that shipped sailed
> > > >>> already. This is because shrinkers are a less restrictive context than
> > > >>> mmu notifier invalidation, and we wait in there too.
> > > >>>
> > > >>> So if you can't wait in shrinkers, you also can't wait in mmu
> > > >>> notifiers (and also not in HMM, wĥich is the same thing). Why do you
> > > >>> need this?
> > > >> The core concept of HMM is that pages are faulted in on demand and it is
> > > >> perfectly valid for one of those pages to be on disk.
> > > >>
> > > >> So when a page fault happens we might need to be able to allocate memory
> > > >> and fetch something from disk to handle that.
> > > >>
> > > >> When this memory allocation then in turn waits for the GPU which is
> > > >> running the HMM process we are pretty much busted.
> > > > Yeah you can't do that. That's the entire infinite fences discussions.
> > >
> > > Yes, exactly.
> > >
> > > > For HMM to work, we need to stop using dma_fence for userspace sync,
> > >
> > > I was considering of separating that into a dma_fence and a hmm_fence.
> > > Or something like this.
> >
> > The trouble is that dma_fence it all its forms is uapi. And on gpus
> > without page fault support dma_fence_wait is still required in
> > allocation contexts. So creating a new kernel structure doesn't really
> > solve anything I think, it needs entire new uapi completely decoupled
> > from memory management. Last time we've done new uapi was probably
> > modifiers, and that's still not rolled out years later.
>
> With hmm there should not be any fence ! You do not need them.
> If you feel you need them than you are doing something horribly
> wrong. See below on what HMM needs and what it means.
>
>
> > > > and you can only use the amdkfd style preempt fences. And preempting
> > > > while the pagefault is pending is I thought something we require.
> > >
> > > Yeah, problem is that most hardware can't do that :)
> > >
> > > Getting page faults to work is hard enough, preempting while waiting for
> > > a fault to return is not something which was anticipated :)
> >
> > Hm last summer in a thread you said you've blocked that because it
> > doesn't work. I agreed, page fault without preempt is rather tough to
> > make work.
> >
> > > > Iow, the HMM page fault handler must not be a dma-fence critical
> > > > section, i.e. it's not allowed to hold up any dma_fence, ever.
> > >
> > > What do you mean with that?
> >
> > dma_fence_signalling_begin/end() annotations essentially, i.e.
> > cross-release dependencies. Or the other way round, if you want to be
> > able to allocate memory you have to guarantee that you're never
> > holding up a dma_fence.
>
> Correct nothing regarding dma/ttm/gem should creep into HMM code
> path.
>
>
> For HMM what you want when handling GPU fault is doing it without
> holding any GPU driver locks so that the regular page fault handler
> code path can go back into the GPU driver (through shrinker) without
> worrying about it.
>
> This is how nouveau does it:
>     - get event about page fault (might hold some GPU lock)
>     - walk the event buffer to get all faulting addresses
>       (might hold some GPU lock)
>
>     ! DROP ALL GPU/DRIVER LOCK !
>
>     - try to coallesce fault together (adjacent address
>       trigger a fault for a single range)
>     - call in HMM/mmu notifier helpers to handle the fault
>     - take GPU lock (svmm->mutex for nouveau)
>     - from the fault result update GPU page table
>
>     - Tell GPU it can resume (this can be done after releasing
>       the lock below, it is just up to the device driver folks
>       to decide when to do that).
>
>     - unlock GPU (svmm->mutex for nouveau)
>
> (see nouveau_svm.c look for range_fault)
>
> GPU page fault should never need to rely on shrinker to succeed
> into reclaiming memory. The rational here is that regular memory
> reclaim (which includes regular anonymous or mmap file back
> memory) will be able to make forward progress even during GPU
> page fault. This of course means that while servicing a GPU page
> fault you might freeup memory that was also use by the GPU and
> which will re-fault again but the lru list should make that very
> unlikely.
>
>
> Note that for regular memory reclaim to make forward progress
> means that the GPU _must_ support asynchronous GPU page table
> update ie being able to unmap things from GPU page table and
> flush GPU TLB without having to wait for anything running on
> the GPU. This is where not all hardware might work. Only things
> populated through HMM need to be unmapped (dma/ttm/gem should
> not be considered unless they too can be unmapped without
> deadlock).
>
> For nouveau this is done through register write (see
> gf100_vmm_invalidate and gp100_vmm_invalidate_pdb and the
> GPU page table itself is updated through the CPU).
>
> I think issue for AMD here is that you rely on updating the
> GPU page table through the GPU command processor and this
> will not work with HMM. Unless you have a CP channel that
> can always make forward progress no matter what. Or update
> the GPU page table with the CPU and flush GPU TLB through
> some register write (iirc this should be doable on AMD but
> i forgetting things now).
>
>
> There is no need for preemption even if it is a nice to have
> feature (lazy GPU design architect and hardware designer ! ;)).
>
>
> Now if dma/ttm/gem GPU memory allocation use all the system
> memory then we are in trouble. This was the whole reason for
> limiting ourself, once upon a time, to not use over 50% of
> memory with dma/ttm/gem. I do not have any good way to solve
> that. If to make forward progress the GPU needs to allocate
> more memory but it already has exhausted all memory through
> dma/ttm/gem then there is just nothing we can do.
>
> I still believe we should limit dma/ttm/gem memory usage so
> that we never endup pinning all system memory in GPU realms.

Well this is exactly what we're trying to do here. And to be able to
shrink dma/ttm/gem memory you eventually have to call dma_fence_wait,
which is where the entire trouble comes in.

The other source of trouble is buffer based userptr, where we also
call dma_fence_wait.

And of course through depdencies (even across gpus) this
dma_fence_wait could end up being blocked on running some (non-hmm)
context on the same gpu engine which is currently running your hmm
context and stuck on some kind of fault.

Also, with the recent dma-fence annotation patch we already locked
down the rule that dma_fence_wait from memory reclaim paths is ok
(i915 already has no limit on system memory usage, and Christian
really wants to drop it for ttm drivers too).

> Hopes this help clarify requirement and expectation.

So ... what now?

I see a few options:
- require preempt. Tough on the hardware.
- rework the entire memory handling. Not sure that's possible, since
it would mean that HMM gpus and buffer based memory mangement gpus
cannot co-exist on the same system. Flag days like that across an
entire subsystem dont work.
- gang scheduling, i.e. when a gpu is running an HMM context is must
guarantee that there's not a single buffer based dma-buf/ttm/gem
context pending (and block any execbuf/cs ioctl before we submit a new
one)
- .. something else?

Cheers, Daniel
-- 
Daniel Vetter
Software Engineer, Intel Corporation
http://blog.ffwll.ch