[rfc] drm/ttm/memcg: simplest initial memcg/ttm integration (v2)

Fri May 16 17:42:08 UTC 2025

On 5/16/25 18:41, Johannes Weiner wrote:
>>> Listen, none of this is even remotely new. This isn't the first cache
>>> we're tracking, and it's not the first consumer that can outlive the
>>> controlling cgroup.
>>
>> Yes, I knew about all of that and I find that extremely questionable
>> on existing handling as well.
> 
> This code handles billions of containers every day, but we'll be sure
> to consult you on the next redesign.

Well yes, please do so. I'm working on Linux for around 30 years now and halve of that on device memory management.

And the subsystems I maintain is used by literally billion Android devices and HPC datacenters

One of the reasons we don't have a good integration between device memory and cgroups is because specific requirements have been ignored while designing cgroups.

That cgroups works for a lot of use cases doesn't mean that it does for all of them.

>> Memory pools which are only used to improve allocation performance
>> are something the kernel handles transparently and are completely
>> outside of any cgroup tracking whatsoever.
> 
> You're describing a cache. It doesn't matter whether it's caching CPU
> work, IO work or network packets.

A cache description doesn't really fit this pool here.

The memory properties are similar to what GFP_DMA or GFP_DMA32 provide.

The reasons we haven't moved this into the core memory management is because it is completely x86 specific and only used by a rather specific group of devices.

> What matters is what it takes to recycle those pages for other
> purposes - especially non-GPU purposes.

Exactly that, yes. From the TTM pool pages can be given back to the core OS at any time. It's just a bunch of extra CPU cycles.

> And more importantly, *what other memory in other cgroups they
> displace in the meantime*.

What do you mean with that?

Other cgroups are not affected by anything the allocating cgroup does, except for the extra CPU overhead while giving pages back to the core OS, but that is negligible we haven't even optimized this code path. 

> It's really not that difficult to see an isolation issue here.
> 
> Anyway, it doesn't look like there is a lot of value in continuing
> this conversation, so I'm going to check out of this subthread.

Please don't!

I absolutely need to get this knowledge into somebody head on the cgroup side or we will have this discussion over and over again.

Regards,
Christian.