[PATCH 1/2] drm/ttm: rework ttm_tt page limit v2

Thu Dec 17 15:26:35 UTC 2020

On Thu, Dec 17, 2020 at 4:10 PM Christian König
<christian.koenig at amd.com> wrote:
>
> Am 17.12.20 um 15:36 schrieb Daniel Vetter:
> > On Thu, Dec 17, 2020 at 2:46 PM Christian König
> > <ckoenig.leichtzumerken at gmail.com> wrote:
> >> Am 16.12.20 um 16:09 schrieb Daniel Vetter:
> >>> On Wed, Dec 16, 2020 at 03:04:26PM +0100, Christian König wrote:
> >>> [SNIP]
> >>>> +
> >>>> +/* As long as pages are available make sure to release at least one */
> >>>> +static unsigned long ttm_tt_shrinker_scan(struct shrinker *shrink,
> >>>> +                                      struct shrink_control *sc)
> >>>> +{
> >>>> +    struct ttm_operation_ctx ctx = {
> >>>> +            .no_wait_gpu = true
> >>> Iirc there's an eventual shrinker limit where it gets desperate. I think
> >>> once we hit that, we should allow gpu waits. But it's not passed to
> >>> shrinkers for reasons, so maybe we should have a second round that tries
> >>> to more actively shrink objects if we fell substantially short of what
> >>> reclaim expected us to do?
> >> I think we should try to avoid waiting for the GPU in the shrinker callback.
> >>
> >> When we get HMM we will have cases where the shrinker is called from
> >> there and we can't wait for the GPU then without causing deadlocks.
> > Uh that doesn't work. Also, the current rules are that you are allowed
> > to call dma_fence_wait from shrinker callbacks, so that shipped sailed
> > already. This is because shrinkers are a less restrictive context than
> > mmu notifier invalidation, and we wait in there too.
> >
> > So if you can't wait in shrinkers, you also can't wait in mmu
> > notifiers (and also not in HMM, wĥich is the same thing). Why do you
> > need this?
>
> The core concept of HMM is that pages are faulted in on demand and it is
> perfectly valid for one of those pages to be on disk.
>
> So when a page fault happens we might need to be able to allocate memory
> and fetch something from disk to handle that.
>
> When this memory allocation then in turn waits for the GPU which is
> running the HMM process we are pretty much busted.

Yeah you can't do that. That's the entire infinite fences discussions.
For HMM to work, we need to stop using dma_fence for userspace sync,
and you can only use the amdkfd style preempt fences. And preempting
while the pagefault is pending is I thought something we require.

Iow, the HMM page fault handler must not be a dma-fence critical
section, i.e. it's not allowed to hold up any dma_fence, ever.

One consequence of this is that you can use HMM for compute, but until
we've revamped all the linux winsys layers, not for gl/vk. Or at least
I'm not seeing how.

Also like I said, dma_fence_wait is already allowed in mmu notifiers,
so we've already locked down these semantics even more. Due to the
nesting of gfp allocation contexts allowing dma_fence_wait in mmu
notifiers (i.e. __GFP_ALLOW_RECLAIM or whatever the flag is exactly)
implies it's allowed in shrinkers. And only if you forbid it from from
all allocations contexts (which makes all buffer object managed gpu
memory essentially pinned, exactly what you're trying to lift here) do
you get what you want.

The other option is to make HMM and dma-buf completely disjoint worlds
with no overlap, and gang scheduling on the gpu (to guarantee that
there's never any dma_fence in pending state while an HMM task might
cause a fault).

> >>> Also don't we have a trylock_only flag here to make sure drivers don't do
> >>> something stupid?
> >> Mhm, I'm pretty sure drivers should only be minimal involved.
> > In the move callback they might try to acquire other locks. Or is that
> > a driver bug?
>
> That would be a rather serious driver bug at the moment.
>
> > Just kinda feels wrong if we have this and don't set it.
> >
> >>>> +    };
> >>>> +    int ret;
> >>>> +
> >>>> +    if (sc->gfp_mask & GFP_NOFS)
> >>>> +            return 0;
> >>>> +
> >>>> +    ret = ttm_bo_swapout(&ctx, GFP_NOFS);
> >>>> +    return ret < 0 ? SHRINK_EMPTY : ret;
> >>>> +}
> >>>> +
> >>>> +/* Return the number of pages available or SHRINK_EMPTY if we have none */
> >>>> +static unsigned long ttm_tt_shrinker_count(struct shrinker *shrink,
> >>>> +                                       struct shrink_control *sc)
> >>>> +{
> >>>> +    struct ttm_buffer_object *bo;
> >>>> +    unsigned long num_pages = 0;
> >>>> +    unsigned int i;
> >>>> +
> >>>> +    if (sc->gfp_mask & GFP_NOFS)
> >>>> +            return 0;
> >>> The count function should always count, and I'm not seeing a reason why
> >>> you couldn't do that here ... Also my understanding is that GFP_NOFS never
> >>> goes into shrinkers (the NOFS comes from shrinkers originally only being
> >>> used for filesystem objects), so this is double redundant.
> >>>
> >>> My understanding is that gfp_mask is just to convey the right zones and
> >>> stuff, so that your shrinker can try to shrink objects in the right zones.
> >>> Hence I think the check in the _scan() function should also be removed.
> >>>
> >>> Also the non __ prefixed flags are the combinations callers are supposed
> >>> to look at. Memory reclaim code needs to look at the __GFP flags, see e.g.
> >>> gfpflags_allow_blocking() or fs_reclaim_acquire().
> >> Ok got it. But don't we need to somehow avoid recursion here?
> > How can you recurse?
> >
> > GFP_KERNEL includes the __GFP_FS flag, which allows calling into
> > shrinkers. GFP_NOFS (which we're using when we're in the shrinker)
> > does not have the __GFP_FS flag, and so calling into shrinkers isn't
> > allowed. It's still allowed to clean out page cache and do io in
> > general (which is the __GFP_IO flag), and it's also allowed to do
> > memory reclaim of userspace ptes, which can involve calling into mmu
> > notifiers (the __GFP_RECLAIM flag is for that). So rules are:
> > - GFP_KERNEL for everyone who's not special
> > - GFP_NOFS from shrinkers.
> > - GFP_NOIO from block io handlers (not relevant for gpus)
> > - GFP_ATOMIC only from mmu notifier callbacks. That this is  the
> > reason why nothing in dma-fence signalling critical path is allowed to
> > allocate memory with anything else than GFP_ATOMIC. I guess I need to
> > resurrect my patches, but I thought when we discussed this it's clear
> > that at least in theory all these allocations in scheduler code are
> > bugs and need to be replaced by mempool allocations.
> >
> > So where do you want to recurse here?
>
> I wasn't aware that without __GFP_FS shrinkers are not called.

Maybe double check, but that's at least my understanding. GFP flags
are flags, but in reality it's a strictly nesting hierarchy:
GFP_KERNEL > GFP_NOFS > GFP_NOIO > GFP_RELCAIM > GFP_ATOMIC (ok atomic
is special, since it's allowed to dip into emergency reserve).
-Daniel

> Thanks for clearing that up,
> Christian.
>
> >
> > Cheers, Daniel
> >
> >>> [SNIP]
> >>>> +int ttm_tt_mgr_init(void);
> >>>> +void ttm_tt_mgr_fini(void);
> >>>> +
> >>>>    #if IS_ENABLED(CONFIG_AGP)
> >>>>    #include <linux/agp_backend.h>
> >>> For testing I strongly recommend a debugfs file to trigger this shrinker
> >>> completely, from the right lockdep context (i.e. using
> >>> fs_reclaim_acquire/release()). Much easier to test that way. See
> >>> i915_drop_caches_set() in i915_debugfs.c.
> >>>
> >>> That way you can fully test it all without hitting anything remotely
> >>> resembling actual OOM, which tends to kill all kinds of things.
> >> That's exactly the reason I was switching from sysfs to debugfs in the
> >> other patch set.
> >>
> >> Ok, in this case I'm going to reorder all that stuff and send out the
> >> debugfs patches first.
> >>
> >>> Aside from the detail work I think this is going in the right direction.
> >>> -Daniel
> >> Thanks,
> >> Christian.
> >
> >
>

-- 
Daniel Vetter
Software Engineer, Intel Corporation
http://blog.ffwll.ch