[rfc] drm/ttm/memcg: simplest initial memcg/ttm integration (v2)

Fri May 16 20:25:02 UTC 2025

On Sat, 17 May 2025 at 06:04, Johannes Weiner <hannes at cmpxchg.org> wrote:
>
> On Fri, May 16, 2025 at 07:42:08PM +0200, Christian König wrote:
> > On 5/16/25 18:41, Johannes Weiner wrote:
> > >>> Listen, none of this is even remotely new. This isn't the first cache
> > >>> we're tracking, and it's not the first consumer that can outlive the
> > >>> controlling cgroup.
> > >>
> > >> Yes, I knew about all of that and I find that extremely questionable
> > >> on existing handling as well.
> > >
> > > This code handles billions of containers every day, but we'll be sure
> > > to consult you on the next redesign.
> >
> > Well yes, please do so. I'm working on Linux for around 30 years now and halve of that on device memory management.
> >
> > And the subsystems I maintain is used by literally billion Android devices and HPC datacenters
> >
> > One of the reasons we don't have a good integration between device memory and cgroups is because specific requirements have been ignored while designing cgroups.
> >
> > That cgroups works for a lot of use cases doesn't mean that it does for all of them.
> >
> > >> Memory pools which are only used to improve allocation performance
> > >> are something the kernel handles transparently and are completely
> > >> outside of any cgroup tracking whatsoever.
> > >
> > > You're describing a cache. It doesn't matter whether it's caching CPU
> > > work, IO work or network packets.
> >
> > A cache description doesn't really fit this pool here.
> >
> > The memory properties are similar to what GFP_DMA or GFP_DMA32
> > provide.
> >
> > The reasons we haven't moved this into the core memory management is
> > because it is completely x86 specific and only used by a rather
> > specific group of devices.
>
> I fully understand that. It's about memory properties.
>
> What I think you're also saying is that the best solution would be
> that you could ask the core MM for pages with a specific property, and
> it would hand you pages that were previously freed with those same
> properties. Or, if none such pages are on the freelists, it would grab
> free pages with different properties and convert them on the fly.
>
> For all intents and purposes, this free memory would then be trivially
> fungible between drm use, non-drm use, and different cgroups - except
> for a few CPU cycles when converting but that's *probably* negligible?
> And now you could get rid of the "hack" in drm and didn't have to hang
> on to special-property pages and implement a shrinker at all.
>
> So far so good.
>
> But that just isn't the implementation of today. And the devil is very
> much in the details with this:
>
> Your memory attribute conversions are currently tied to a *shrinker*.
>
> This means the conversion doesn't trivially happen in the allocator,
> it happens from *reclaim context*.
>
> Now *your* shrinker is fairly cheap to run, so I do understand when
> you're saying in exasperation: We give this memory back if somebody
> needs it for other purposes. What *is* the big deal?
>
> The *reclaim context* is the big deal. The problem is *all the other
> shrinkers that run at this time as well*. Because you held onto those
> pages long enough that they contributed to a bonafide, general memory
> shortage situation. And *that* has consequences for other cgroups.

I think this is where we have 2 options:
(a) moving this stuff into core mm and out of shrinker context
(b) fix our shrinker to be cgroup aware and solve that first.

The main question I have for Christian, is can you give me a list of
use cases that this will seriously negatively effect if we proceed
with (b).

>From my naive desktop use case and HPC use case scenarios, I'm not
seeing a massive hit, now maybe I see more consistency from an
application overheads inside a cgroup.

Desktop use-case:
The user session and everything inside the user-session, compositor,
apps are all in a single cgroup, any pools memory usage will be
reusable between all the users active session, if there are multiple
users, they won't have the benefit of pages from others but their own
pool will be available.

HPC use-case:
One cgroup per application running in some sort of batch system. There
will be a downside at app launch if there has already been a bunch of
other applications launched on the machine the have filled the pool,
but by default in the cold start case the app won't get any worse
behaviour than it's current worst case, it will get consistent
behaviour of initial allocations being worst case in a new cgroup vs
now where they might benefit from previously running cgroups having
allocated pooled memory, but I'm not sure the benefit outweighs the
upside here since they reallly want containers contained.

Android? I've no idea.

Like what can we live with here, vs what needs to be a Kconfig option
vs what needs to be a kernel command line option,

I'm also happy to look at (a) but I think for (a) it's not just
uncached pool that is the problem, the dma pools will be harder to
deal with.

Dave.