[PATCH 1/2] drm/ttm: rework ttm_tt page limit v2

Thu Dec 17 15:45:55 UTC 2020

On Thu, Dec 17, 2020 at 4:36 PM Christian König
<christian.koenig at amd.com> wrote:
> Am 17.12.20 um 16:26 schrieb Daniel Vetter:
> > On Thu, Dec 17, 2020 at 4:10 PM Christian König
> > <christian.koenig at amd.com> wrote:
> >> Am 17.12.20 um 15:36 schrieb Daniel Vetter:
> >>> On Thu, Dec 17, 2020 at 2:46 PM Christian König
> >>> <ckoenig.leichtzumerken at gmail.com> wrote:
> >>>> Am 16.12.20 um 16:09 schrieb Daniel Vetter:
> >>>>> On Wed, Dec 16, 2020 at 03:04:26PM +0100, Christian König wrote:
> >>>>> [SNIP]
> >>>>>> +
> >>>>>> +/* As long as pages are available make sure to release at least one */
> >>>>>> +static unsigned long ttm_tt_shrinker_scan(struct shrinker *shrink,
> >>>>>> +                                      struct shrink_control *sc)
> >>>>>> +{
> >>>>>> +    struct ttm_operation_ctx ctx = {
> >>>>>> +            .no_wait_gpu = true
> >>>>> Iirc there's an eventual shrinker limit where it gets desperate. I think
> >>>>> once we hit that, we should allow gpu waits. But it's not passed to
> >>>>> shrinkers for reasons, so maybe we should have a second round that tries
> >>>>> to more actively shrink objects if we fell substantially short of what
> >>>>> reclaim expected us to do?
> >>>> I think we should try to avoid waiting for the GPU in the shrinker callback.
> >>>>
> >>>> When we get HMM we will have cases where the shrinker is called from
> >>>> there and we can't wait for the GPU then without causing deadlocks.
> >>> Uh that doesn't work. Also, the current rules are that you are allowed
> >>> to call dma_fence_wait from shrinker callbacks, so that shipped sailed
> >>> already. This is because shrinkers are a less restrictive context than
> >>> mmu notifier invalidation, and we wait in there too.
> >>>
> >>> So if you can't wait in shrinkers, you also can't wait in mmu
> >>> notifiers (and also not in HMM, wĥich is the same thing). Why do you
> >>> need this?
> >> The core concept of HMM is that pages are faulted in on demand and it is
> >> perfectly valid for one of those pages to be on disk.
> >>
> >> So when a page fault happens we might need to be able to allocate memory
> >> and fetch something from disk to handle that.
> >>
> >> When this memory allocation then in turn waits for the GPU which is
> >> running the HMM process we are pretty much busted.
> > Yeah you can't do that. That's the entire infinite fences discussions.
>
> Yes, exactly.
>
> > For HMM to work, we need to stop using dma_fence for userspace sync,
>
> I was considering of separating that into a dma_fence and a hmm_fence.
> Or something like this.

The trouble is that dma_fence it all its forms is uapi. And on gpus
without page fault support dma_fence_wait is still required in
allocation contexts. So creating a new kernel structure doesn't really
solve anything I think, it needs entire new uapi completely decoupled
from memory management. Last time we've done new uapi was probably
modifiers, and that's still not rolled out years later.

> > and you can only use the amdkfd style preempt fences. And preempting
> > while the pagefault is pending is I thought something we require.
>
> Yeah, problem is that most hardware can't do that :)
>
> Getting page faults to work is hard enough, preempting while waiting for
> a fault to return is not something which was anticipated :)

Hm last summer in a thread you said you've blocked that because it
doesn't work. I agreed, page fault without preempt is rather tough to
make work.

> > Iow, the HMM page fault handler must not be a dma-fence critical
> > section, i.e. it's not allowed to hold up any dma_fence, ever.
>
> What do you mean with that?

dma_fence_signalling_begin/end() annotations essentially, i.e.
cross-release dependencies. Or the other way round, if you want to be
able to allocate memory you have to guarantee that you're never
holding up a dma_fence.
-Daniel

> > One consequence of this is that you can use HMM for compute, but until
> > we've revamped all the linux winsys layers, not for gl/vk. Or at least
> > I'm not seeing how.
> >
> > Also like I said, dma_fence_wait is already allowed in mmu notifiers,
> > so we've already locked down these semantics even more. Due to the
> > nesting of gfp allocation contexts allowing dma_fence_wait in mmu
> > notifiers (i.e. __GFP_ALLOW_RECLAIM or whatever the flag is exactly)
> > implies it's allowed in shrinkers. And only if you forbid it from from
> > all allocations contexts (which makes all buffer object managed gpu
> > memory essentially pinned, exactly what you're trying to lift here) do
> > you get what you want.
> >
> > The other option is to make HMM and dma-buf completely disjoint worlds
> > with no overlap, and gang scheduling on the gpu (to guarantee that
> > there's never any dma_fence in pending state while an HMM task might
> > cause a fault).
> >
> >> [SNIP]
> >>> So where do you want to recurse here?
> >> I wasn't aware that without __GFP_FS shrinkers are not called.
> > Maybe double check, but that's at least my understanding. GFP flags
> > are flags, but in reality it's a strictly nesting hierarchy:
> > GFP_KERNEL > GFP_NOFS > GFP_NOIO > GFP_RELCAIM > GFP_ATOMIC (ok atomic
> > is special, since it's allowed to dip into emergency reserve).
>
> Going to read myself into that over the holidays.
>
> Christian.

-- 
Daniel Vetter
Software Engineer, Intel Corporation
http://blog.ffwll.ch