[rfc] drm/ttm/memcg: simplest initial memcg/ttm integration (v2)

Fri May 16 20:04:23 UTC 2025

On Fri, May 16, 2025 at 07:42:08PM +0200, Christian König wrote:
> On 5/16/25 18:41, Johannes Weiner wrote:
> >>> Listen, none of this is even remotely new. This isn't the first cache
> >>> we're tracking, and it's not the first consumer that can outlive the
> >>> controlling cgroup.
> >>
> >> Yes, I knew about all of that and I find that extremely questionable
> >> on existing handling as well.
> > 
> > This code handles billions of containers every day, but we'll be sure
> > to consult you on the next redesign.
> 
> Well yes, please do so. I'm working on Linux for around 30 years now and halve of that on device memory management.
> 
> And the subsystems I maintain is used by literally billion Android devices and HPC datacenters
> 
> One of the reasons we don't have a good integration between device memory and cgroups is because specific requirements have been ignored while designing cgroups.
> 
> That cgroups works for a lot of use cases doesn't mean that it does for all of them.
> 
> >> Memory pools which are only used to improve allocation performance
> >> are something the kernel handles transparently and are completely
> >> outside of any cgroup tracking whatsoever.
> > 
> > You're describing a cache. It doesn't matter whether it's caching CPU
> > work, IO work or network packets.
> 
> A cache description doesn't really fit this pool here.
> 
> The memory properties are similar to what GFP_DMA or GFP_DMA32
> provide.
>
> The reasons we haven't moved this into the core memory management is
> because it is completely x86 specific and only used by a rather
> specific group of devices.

I fully understand that. It's about memory properties.

What I think you're also saying is that the best solution would be
that you could ask the core MM for pages with a specific property, and
it would hand you pages that were previously freed with those same
properties. Or, if none such pages are on the freelists, it would grab
free pages with different properties and convert them on the fly.

For all intents and purposes, this free memory would then be trivially
fungible between drm use, non-drm use, and different cgroups - except
for a few CPU cycles when converting but that's *probably* negligible?
And now you could get rid of the "hack" in drm and didn't have to hang
on to special-property pages and implement a shrinker at all.

So far so good.

But that just isn't the implementation of today. And the devil is very
much in the details with this:

Your memory attribute conversions are currently tied to a *shrinker*.

This means the conversion doesn't trivially happen in the allocator,
it happens from *reclaim context*.

Now *your* shrinker is fairly cheap to run, so I do understand when
you're saying in exasperation: We give this memory back if somebody
needs it for other purposes. What *is* the big deal?

The *reclaim context* is the big deal. The problem is *all the other
shrinkers that run at this time as well*. Because you held onto those
pages long enough that they contributed to a bonafide, general memory
shortage situation. And *that* has consequences for other cgroups.

> > What matters is what it takes to recycle those pages for other
> > purposes - especially non-GPU purposes.
> 
> Exactly that, yes. From the TTM pool pages can be given back to the
> core OS at any time. It's just a bunch of extra CPU cycles.
>
> > And more importantly, *what other memory in other cgroups they
> > displace in the meantime*.
> 
> What do you mean with that?
> 
> Other cgroups are not affected by anything the allocating cgroup
> does, except for the extra CPU overhead while giving pages back to
> the core OS, but that is negligible we haven't even optimized this
> code path.

I hope the answer to this question is apparent now.

But to illustrate the problem better, let's consider the following
container setup.

A system has 10G of memory. You run two cgroups on it that each have a
a limit of 5G:

             system (10G)
            /      \
           A (5G)   B (5G)

Let's say A is running some database and is using its full 5G.

B is first doing some buffered IO, instantiating up to 5G worth of
file cache. Since the file cache is cgroup-aware, those pages will go
onto the private LRU list of B. And they will remain there until those
cache pages are fully reclaimed.

B then malloc()s. Because it's at the cgroup limit, it's forced into
cgroup reclaim on its private LRUs, where it recycles some of its old
page cache to satisfy the heap request.

A was not affected by anything that occurred in B.

---

Now let's consider the same starting scenario, but instead B is
interacting with the gpu driver and creates 5G worth of ttm objects.

Once its done with them, you put the pages into the pool and uncharge
the memory from B.

Now B mallocs() again. The cgroup is not maxed out - it's empty in
fact. So no cgroup reclaim happens.

However, at this point, A has 5G allocated, and there are still 5G in
the drm driver. The *system itself* is out of memory now.

So B enters *global* reclaim to find pages for its heap request.

It invokes all the shrinkers and runs reclaim on all cgroups.

In the process, it will claw back some pages from ttm; but it will
*also* reclaim all kinds of caches, maybe even swap stuff, from A!

Now *A* starts faulting due to the pages that were stolen and tries to
allocate. But memory is still full, because B backfilled it with heap.

So now *A* goes into global reclaim as well: it takes some pages from
the ttm pool, some from B, *and some from itself*.

It can take several iterations of this until the ttm pool has been
fully drained, and A has all its memory faulted back in and is running
at full speed again.

In this scenario, A was directly affected, potentially quite severely,
by B's actions.

This is the definition of containment failure.

Consider two aggravating additions to this scenario:

1. While A *could* in theory benefit from the pool pages too, let's
   say it never interacts with the GPU at all. So it's paying the cost
   for something that *only benefits B*.

2. B is malicious. It does the above sequence - interact with ttm, let
   the pool escape its cgroup, then allocate heap - rapidly over and
   over again. It effectively DoSes A.

---

So as long as the pool is implemented as it is today, it should very
much be per cgroup.

Reclaim lifetime means it can displace other memory with reclaim
lifetime.

You cannot assume other cgroups benefit from this memory.

You cannot trust other cgroups to play nice.