[PATCH 1/2] drm/ttm: rework ttm_tt page limit v2

Thu Dec 17 18:09:17 UTC 2020

Adding few folks on cc just to raise awareness and so that i
could get corrected if i said anything wrong.

On Thu, Dec 17, 2020 at 04:45:55PM +0100, Daniel Vetter wrote:
> On Thu, Dec 17, 2020 at 4:36 PM Christian König
> <christian.koenig at amd.com> wrote:
> > Am 17.12.20 um 16:26 schrieb Daniel Vetter:
> > > On Thu, Dec 17, 2020 at 4:10 PM Christian König
> > > <christian.koenig at amd.com> wrote:
> > >> Am 17.12.20 um 15:36 schrieb Daniel Vetter:
> > >>> On Thu, Dec 17, 2020 at 2:46 PM Christian König
> > >>> <ckoenig.leichtzumerken at gmail.com> wrote:
> > >>>> Am 16.12.20 um 16:09 schrieb Daniel Vetter:
> > >>>>> On Wed, Dec 16, 2020 at 03:04:26PM +0100, Christian König wrote:
> > >>>>> [SNIP]
> > >>>>>> +
> > >>>>>> +/* As long as pages are available make sure to release at least one */
> > >>>>>> +static unsigned long ttm_tt_shrinker_scan(struct shrinker *shrink,
> > >>>>>> +                                      struct shrink_control *sc)
> > >>>>>> +{
> > >>>>>> +    struct ttm_operation_ctx ctx = {
> > >>>>>> +            .no_wait_gpu = true
> > >>>>> Iirc there's an eventual shrinker limit where it gets desperate. I think
> > >>>>> once we hit that, we should allow gpu waits. But it's not passed to
> > >>>>> shrinkers for reasons, so maybe we should have a second round that tries
> > >>>>> to more actively shrink objects if we fell substantially short of what
> > >>>>> reclaim expected us to do?
> > >>>> I think we should try to avoid waiting for the GPU in the shrinker callback.
> > >>>>
> > >>>> When we get HMM we will have cases where the shrinker is called from
> > >>>> there and we can't wait for the GPU then without causing deadlocks.
> > >>> Uh that doesn't work. Also, the current rules are that you are allowed
> > >>> to call dma_fence_wait from shrinker callbacks, so that shipped sailed
> > >>> already. This is because shrinkers are a less restrictive context than
> > >>> mmu notifier invalidation, and we wait in there too.
> > >>>
> > >>> So if you can't wait in shrinkers, you also can't wait in mmu
> > >>> notifiers (and also not in HMM, wĥich is the same thing). Why do you
> > >>> need this?
> > >> The core concept of HMM is that pages are faulted in on demand and it is
> > >> perfectly valid for one of those pages to be on disk.
> > >>
> > >> So when a page fault happens we might need to be able to allocate memory
> > >> and fetch something from disk to handle that.
> > >>
> > >> When this memory allocation then in turn waits for the GPU which is
> > >> running the HMM process we are pretty much busted.
> > > Yeah you can't do that. That's the entire infinite fences discussions.
> >
> > Yes, exactly.
> >
> > > For HMM to work, we need to stop using dma_fence for userspace sync,
> >
> > I was considering of separating that into a dma_fence and a hmm_fence.
> > Or something like this.
> 
> The trouble is that dma_fence it all its forms is uapi. And on gpus
> without page fault support dma_fence_wait is still required in
> allocation contexts. So creating a new kernel structure doesn't really
> solve anything I think, it needs entire new uapi completely decoupled
> from memory management. Last time we've done new uapi was probably
> modifiers, and that's still not rolled out years later.

With hmm there should not be any fence ! You do not need them.
If you feel you need them than you are doing something horribly
wrong. See below on what HMM needs and what it means.

> > > and you can only use the amdkfd style preempt fences. And preempting
> > > while the pagefault is pending is I thought something we require.
> >
> > Yeah, problem is that most hardware can't do that :)
> >
> > Getting page faults to work is hard enough, preempting while waiting for
> > a fault to return is not something which was anticipated :)
> 
> Hm last summer in a thread you said you've blocked that because it
> doesn't work. I agreed, page fault without preempt is rather tough to
> make work.
> 
> > > Iow, the HMM page fault handler must not be a dma-fence critical
> > > section, i.e. it's not allowed to hold up any dma_fence, ever.
> >
> > What do you mean with that?
> 
> dma_fence_signalling_begin/end() annotations essentially, i.e.
> cross-release dependencies. Or the other way round, if you want to be
> able to allocate memory you have to guarantee that you're never
> holding up a dma_fence.

Correct nothing regarding dma/ttm/gem should creep into HMM code
path.

For HMM what you want when handling GPU fault is doing it without
holding any GPU driver locks so that the regular page fault handler
code path can go back into the GPU driver (through shrinker) without
worrying about it.

This is how nouveau does it:
    - get event about page fault (might hold some GPU lock)
    - walk the event buffer to get all faulting addresses
      (might hold some GPU lock)

    ! DROP ALL GPU/DRIVER LOCK !

    - try to coallesce fault together (adjacent address
      trigger a fault for a single range)
    - call in HMM/mmu notifier helpers to handle the fault
    - take GPU lock (svmm->mutex for nouveau)
    - from the fault result update GPU page table

    - Tell GPU it can resume (this can be done after releasing
      the lock below, it is just up to the device driver folks
      to decide when to do that).

    - unlock GPU (svmm->mutex for nouveau)

(see nouveau_svm.c look for range_fault)

GPU page fault should never need to rely on shrinker to succeed
into reclaiming memory. The rational here is that regular memory
reclaim (which includes regular anonymous or mmap file back
memory) will be able to make forward progress even during GPU
page fault. This of course means that while servicing a GPU page
fault you might freeup memory that was also use by the GPU and
which will re-fault again but the lru list should make that very
unlikely.

Note that for regular memory reclaim to make forward progress
means that the GPU _must_ support asynchronous GPU page table
update ie being able to unmap things from GPU page table and
flush GPU TLB without having to wait for anything running on
the GPU. This is where not all hardware might work. Only things
populated through HMM need to be unmapped (dma/ttm/gem should
not be considered unless they too can be unmapped without
deadlock).

For nouveau this is done through register write (see
gf100_vmm_invalidate and gp100_vmm_invalidate_pdb and the
GPU page table itself is updated through the CPU).

I think issue for AMD here is that you rely on updating the
GPU page table through the GPU command processor and this
will not work with HMM. Unless you have a CP channel that
can always make forward progress no matter what. Or update
the GPU page table with the CPU and flush GPU TLB through
some register write (iirc this should be doable on AMD but
i forgetting things now).

There is no need for preemption even if it is a nice to have
feature (lazy GPU design architect and hardware designer ! ;)).

Now if dma/ttm/gem GPU memory allocation use all the system
memory then we are in trouble. This was the whole reason for
limiting ourself, once upon a time, to not use over 50% of
memory with dma/ttm/gem. I do not have any good way to solve
that. If to make forward progress the GPU needs to allocate
more memory but it already has exhausted all memory through
dma/ttm/gem then there is just nothing we can do.

I still believe we should limit dma/ttm/gem memory usage so
that we never endup pinning all system memory in GPU realms.

Hopes this help clarify requirement and expectation.

Cheers,
Jérôme