[rfc] drm/ttm/memcg: simplest initial memcg/ttm integration (v2)

Tue May 13 07:54:46 UTC 2025

On Thu, May 08, 2025 at 08:03:49AM +1000, Dave Airlie wrote:
> On Thu, 8 May 2025 at 03:52, Johannes Weiner <hannes at cmpxchg.org> wrote:
> >
> > Hello Dave,
> >
> > On Fri, May 02, 2025 at 01:35:59PM +1000, Dave Airlie wrote:
> > > Hey all,
> > >
> > > This is my second attempt at adding the initial simple memcg/ttm
> > > integration.
> > >
> > > This varies from the first attempt in two major ways:
> > >
> > > 1. Instead of using __GFP_ACCOUNT and direct calling kmem charges
> > > for pool memory, and directly hitting the GPU statistic, Waiman
> > > suggested I just do what the network socket stuff did, which looks
> > > simpler. So this adds two new memcg apis that wrap accounting.
> > > The pages no longer get assigned the memcg, it's owned by the
> > > larger BO object which makes more sense.
> >
> > Unfortunately, this was bad advice :(
> >
> > Naked-charging like this is quite awkward from the memcg side. It
> > requires consumer-specific charge paths in the memcg code, adds stat
> > counters that are memcg-only with no system-wide equivalent, and it's
> > difficult for the memcg maintainers to keep track of the link between
> > what's in the counter and the actual physical memory that is supposed
> > to be tracked.
> >
> > The network and a few others like it are rather begrudging exceptions
> > because they do not have a suitable page context or otherwise didn't
> > fit the charging scheme. They're not good examples to follow if it can
> > at all be avoided.
> 
> I unfortunately feel GPU might fit in this category as well, we aren't
> allocating pages in the traditional manner, so we have a number of
> problems with doing it at the page level.
> 
> I think the biggest barrier to doing page level tracking is with the
> page pools we have to keep. When we need CPU uncached pages, we
> allocate a bunch of pages and do the work to fix their cpu mappings to
> be uncached, this work is costly, so when we no longer need the
> uncached pages in an object, we return them to a pool rather than to
> the central allocator. We then use a shrinker to empty the pool and
> undo the cpu mapping change. We also do equivalent pools for dma able
> and 32-bit dma able pages if they are used.

Thanks for explaining the background, this is quite helpful. And I can
see where you're coming from with this.

But I'm actually not so sure it's a good idea to uncharge upon release
into the pool. Those pages might be unused from a GPU driver POV, but
they're not readily available for, say, a heap fault in another group.

For example, AFAIU a cgroup can allocate all of its memory allowance
in GPU memory, free it back into the pool, then fill the resulting
headroom inside the group with other allocations, like page cache or
heap. This lets a cgroup directly trigger allocations worth up to
twice its memory allowance. Do that on an ongoing basis, and you can
really tie up other containers in the shrinkers picking up after you -
just to get to the memory they've been rightfully assigned.

That's a clear containment problem.

System-wide, if I'm looking at the right code, the pool can be up to
50% of memory. That's a lot of used memory not attributed to any
cgroup, even though each allocation is directly triggered by one.

Container people are no fans of big, unattributed pools like this.
Imagine you start with a certain amount of memory in the system and
have a number of containers you would like to run. How much global
memory headroom do you need to set aside, not assigned to any
particular cgroup, to get a gpu cache that serves all cgroups well?

Caching itself is not an issue. A large part of cgroup-tracked memory
is some kind of cache. But what's usually done is to make the cache
itself per cgroup, and tie the size and the lifetime of entries to the
group's memory allowance. That gets you isolation and control.

E.g. if one cgroup in the system is more important than the others,
you can let it have a bigger cache by increasing its memory
allowance. Whereas if the pool is shared, you have no control over
that - less important groups could consume the cache and negatively
impact hit rates of the important group.

So in the GPU case, you'd charge on allocation, free objects into a
cgroup-specific pool, and shrink using a cgroup-specific LRU
list. Freed objects can be reused by this cgroup, but nobody else.
They're reclaimed through memory pressure inside the cgroup, not due
to the action of others. And all allocated memory is accounted for.

I have to admit I'm pretty clueless about the gpu driver internals and
can't really judge how feasible this is. But from a cgroup POV, if you
want proper memory isolation between groups, it seems to me that's the
direction you'd have to take this in.

> This means that if userspace allocates a large uncached object, and we
> account it to the current memcg with __GFP_ACCOUNT, then when it gets
> handed back to the pool we have to remove it from the memcg
> accounting. This was where I used the memcg kmem charge/uncharges,
> uncharge on adding to pool and charge on reuse. However kmem seems
> like the wrong place to charge, but it's where the initial
> __GFP_ACCOUNT will put those pages.

Ah, no need to worry about it. The name is just a historical memcgism,
from back when we first started charging "kernel" allocations, as
opposed to the conventional, pageable userspace memory. It's no longer
a super meaningful distinction, tbh.

You can still add a separate counter for GPU memory.

> This is why the suggestions to use the network solution worked
> better for me, I can do all the work outside the pool code at a
> slightly higher level (the same level where we charge/uncharge
> dmem), and I don't have to worry about handling the different pool
> types etc. We also don't need page->memcg to be set for these pages
> I don't think as all pages are part of a large object and the object
> is what gets swapped or freed, not parts of it.

I agree this doesn't need to be a goal in itself. It would just be a
side effect of charging through __GFP_ACCOUNT and uncharging inside
__free_pages(). What's more important is that the charge lifetime is
correlated with the actual memory allocation.