[PATCH 3/3] drm/ttm: make up to 90% of system memory available

Thu Nov 19 15:42:52 UTC 2020

On Thu, Nov 19, 2020 at 04:27:00PM +0100, Christian König wrote:
> Am 18.11.20 um 22:15 schrieb Daniel Vetter:
> > On Wed, Nov 18, 2020 at 02:35:24PM +0100, Christian König wrote:
> > > Am 17.11.20 um 18:19 schrieb Daniel Vetter:
> > > > On Tue, Nov 17, 2020 at 03:06:15PM +0100, Christian König wrote:
> > > > > Increase the ammount of system memory drivers can use to about 90% of
> > > > > the total available.
> > > > > 
> > > > > Signed-off-by: Christian König <christian.koenig at amd.com>
> > > > > ---
> > > > >    drivers/gpu/drm/ttm/ttm_bo.c | 2 +-
> > > > >    1 file changed, 1 insertion(+), 1 deletion(-)
> > > > > 
> > > > > diff --git a/drivers/gpu/drm/ttm/ttm_bo.c b/drivers/gpu/drm/ttm/ttm_bo.c
> > > > > index a958135cb3fe..0a93df93dba4 100644
> > > > > --- a/drivers/gpu/drm/ttm/ttm_bo.c
> > > > > +++ b/drivers/gpu/drm/ttm/ttm_bo.c
> > > > > @@ -1267,7 +1267,7 @@ static int ttm_bo_global_init(void)
> > > > >    	 * the available system memory.
> > > > >    	 */
> > > > >    	num_pages = (u64)si.totalram * si.mem_unit;
> > > > > -	num_pages = (num_pages * 50 / 100) >> PAGE_SHIFT;
> > > > > +	num_pages = (num_pages * 90 / 100) >> PAGE_SHIFT;
> > > > I don't think this is the design we want. As long as it was set at "half
> > > > of system memory" it was clear that a) it's a hack b) precision didn't
> > > > matter.
> > > Yeah, I'm also wondering what to do here exactly. In generally I would
> > > completely nuke that limit if possible.
> > > 
> > > > But if you go to the limit and still want to keep the "we make sure
> > > > there's no OOM", then precision starts to matter:
> > > > - memory hotplug and hotunplug is a thing
> > > > - userspace can mlock, and it's configureable
> > > > - drivers can pin_user_pages for IO and random other stuff. Some of it is
> > > >     accounted as some subsystem specific mlock (like rdma does iirc), some
> > > >     is just yolo or short term enough (like)
> > > > - none of what we do here takes into considerations any interactions with
> > > >     core mm tracking (like cgroups or numa or anything like that)
> > > OOM is perfectly fine with me, we should just not run into an OOM killer
> > > situation because we call shmem_read_mapping_page_gfp() in the shrinker
> > > callback.
> > > 
> > > Any idea how we could guarantee that?
> > Uh ...
> > 
> > I just realized I missed something between how ttm works and how i915
> > works.  With i915 directly using shmem, all we do in the shrinker is unpin
> > the shmem pages. With ttm we have the page pool in the middle. And it's
> > needed for dma coherent and other things. This is rather fundamental.
> 
> Yeah, the WC/UC handling is otherwise way to slow. Only for normal WB
> allocations we could do this.
> 
> > btw I don't think you'll get OOM, you get lockdep splat because you're
> > recusring on fs_reclaim fake lock. We should probably put a might_alloc()
> > into shmem_read_mapping_page_gfp().
> 
> Yeah, one reason why I haven't enabled that yet.
> 
> > 
> > > > If we want to drop the "half of system ram" limit (and yes that makes
> > > > sense) I think the right design is:
> > > > 
> > > > - we give up on the "no OOM" guarantee.
> > > > 
> > > > - This means if you want real isolation of tasks, we need cgroups, and we
> > > >     need to integrate ttm cgroups with system memory cgroups somehow. Unlike
> > > >     randomly picked hardcoded limits this should work a lot more reliably
> > > >     and be a lot more useful in practical use in the field.
> > > > 
> > > > - This also means that drivers start to fail in interesting ways. I think
> > > >     locking headaches are covered with all the lockdep annotations I've
> > > >     pushed, plus some of the things I still have in-flight (I have a
> > > >     might_alloc() annotations somewhere). That leaves validation of error
> > > >     paths for when allocations fail. Ime a very effective way we used in
> > > >     i915 is (ab)using EINTR restarting, which per drmIoctl uapi spec is
> > > >     requried. We could put a debug mode into ttm_tt which randomly fails
> > > >     with -EINTR to make sure it's all working correctly (plus anything else
> > > >     that allocates memory), and unlike real out-of-memory injection piglit
> > > >     and any other cts will complete without failure. Which gives us an
> > > >     excellent metric for "does it work". Actualy OOM, even injected one,
> > > >     tends to make stuff blow up in a way that's very hard to track and make
> > > >     sure you've got good coverage, since all your usual tests die pretty
> > > >     quickly.
> > > > 
> > > > - ttm_tt needs to play fair with every other system memory user. We need
> > > >     to register a shrinker for each ttm_tt (so usually one per device I
> > > >     guess), which walks the lru (in shrink_count) and uses dma_resv_trylock
> > > >     for actual shrinking. We probably want to move it to SYSTEM first for
> > > >     that shrinker to pick up, so that there's some global fairness across
> > > >     all ttm_tt.
> > > I already have patches for this.
> > > 
> > > What's still missing is teaching the OOM killer which task is using the
> > > memory since memory referenced through the file descriptors are currently
> > > not accounted towards any process resources.
> > Yeah I don't think we've fixed that problem for i915 either. It loves to
> > kill randome other stuff. In igt we solve this by marking any igt testcase
> > as "pls kill this first". Well "solve".
> 
> Well that is a "creative" solution :)
> 
> I will be rather looking into if we can somehow track to which files_struct
> a file belongs to and if we somehow can use this in the OOM killer.
> 
> My last try of giving each file something wasn't received well.
> 
> The real boomer is that you can really nicely create a deny of service using
> memfd_create() because of this :)

Yeah proper accounting is the right fix here I agree.

> > > > - for GFP_DMA32 that means zone aware shrinkers. We've never used those,
> > > >     because thus far i915 didn't have any big need for low memory, so we
> > > >     haven't used this in practice. But it's supposed to be a thing.
> > > I think we can mostly forget about GFP_DMA32, this should only be used for
> > > AGP and other ancient hardware.
> > Ok, that's good. So having an arbitrary "only half of GFP_DMA32" for these
> > should also be acceptable, since back then gpus didn't really have
> > gigabytes of vram yet. Biggest r500 seems to have topped out at 512MB.
> > 
> > How much do we require the dma page pool? Afaiui it's only when there's
> > bounce buffers or something nasty like that present. Are there more use
> > cases?
> 
> Not that I know off.
> 
> > 
> > For fixing the TT -> SYSTEM problem of calling shmem_read_mapping_page_gfp
> > from shrinker I think there's 3 solutions:
> > 
> > 1)Call shmem_read_mapping_page_gfp with GFP_IO. This way we should not be
> >    calling back into our shrinker, which are at GFP_FS level. This isn't
> >    great, but since page reclaim is a giantic loop which repeats as long as
> >    we're making meaningful forward progress, we should be able to trickle
> >    TT pages to SYSTEM pages even under severe memory pressure.
> 
> This is most likely the way to go for us.

GFP_NOFS is the flag, I mixed them up.

> > 2)For the normal page pool (i.e. neither GFP_DMA32 nor dma_alloc_coherent
> >    or write-combine) we stop copying the pages for TT <-> SYSTEM, but just
> >    directly pin the pages we get from shmem. Like all the shmem based soc
> >    drivers do. No copying, no need for allocations, no problem in
> >    shrinkers.
> 
> Not really doable, WC is just to widely used and shmem doesn't support
> larger orders.

Hm I thought you get large pages automatically, if you reassemble them.
It's just the interface that sucks.

btw did you look at 3 below? It might break dma-api assumptions a bit too
badly since we're doing nice page flushing behind everyone's back.
-Daniel

> 
> Regards,
> Christian.
> 
> > 
> > 3)For the remaining page pools we'd need ttmfs so that we can control the
> >    address_space_operations and directly swap out from our pages to
> >    swapout, bypassing shmem and the copying. Iirc Chris Wilson had a
> >    prototype once somewhere under gemfs for this. Only needed if 1) isn't
> >    fast enough for these because there's too many such cases where we care.
> > 
> > At least on intel hw the direction is very much towards fully coherent,
> > so 1&2 should be good enough.
> > -Daniel
> > 
> > > Christian.
> > > 
> > > > It's a bit more code than the oneliner above, but I also think it's a lot
> > > > more solid. Plus it would resolve the last big issue where i915 gem is
> > > > fairly fundamentally different compared to ttm. For that question I think
> > > > once Maarten's locking rework for i915 has landed and all the other ttm
> > > > rework from you and Dave is in, we've resolved them all.
> > > > 
> > > > 
> > > > >    	/* But for DMA32 we limit ourself to only use 2GiB maximum. */
> > > > >    	num_dma32_pages = (u64)(si.totalram - si.totalhigh) * si.mem_unit;
> > > > > -- 
> > > > > 2.25.1
> > > > > 
> 

-- 
Daniel Vetter
Software Engineer, Intel Corporation
http://blog.ffwll.ch