list_lru operation for new child memcg?

Mon May 26 23:49:27 UTC 2025

On Tue, May 27, 2025 at 08:30:22AM +1000, Dave Airlie wrote:
> On Tue, 27 May 2025 at 08:08, Dave Chinner <david at fromorbit.com> wrote:
> >
> > On Tue, May 27, 2025 at 06:32:30AM +1000, Dave Airlie wrote:
> > > Hey all,
> > >
> > > Hope someone here can help me work this out, I've been studying
> > > list_lru a bit this week for possible use in the GPU driver memory
> > > pool code.
> > >
> > > I understand that when a cgroup goes away, it's lru resources get
> > > reparented into the parent resource, however I'm wondering about
> > > operation in the opposite direction and whether this is possible or
> > > something we'd like to add.
> >
> > It's possible, but you need to write the code yourself.
> >
> > You might want to look at the zswap code, it has a memcg-aware
> > global object LRU that charges individual entries to the memcg that
> > use space in the pool.
> >
> > > Scenario:
> > > 1. Toplevel cgroup - empty LRU
> > > 2. Child cgroup A created, adds a bunch of special pages to the LRU
> > > 3. Child cgroup A dies, pages in lru list get reparented to toplevel cgroup
> > > 4. Child cgroup B created. Now if B wants to get special pages from
> > > the pool, is it possible for B to get access to the LRU from the
> > > toplevel cgroup automatically?
> > >
> > > Ideally B would takes pages from the
> > > parent LRU, and put them back into it's LRU, and then reuse the ones
> > > from it's LRU, and only finally allocate new special pages once it has
> > > none and the parent cgroup has none as well.
> >
> > The list_lru has nothing to do with what context gets a new
> > reference to the objects on the LRU. This is something that your
> > pool object lookup/allocation interface would do.
> >
> > If your lookup interface is cgroup aware, it can look up the parent,
> > search it's pool and dequeue from the LRU via:
> >
> >         parent_memcg = parent_mem_cgroup(child_memcg);
> >         <lookup object>
> >         list_lru_del(<object> ..., parent_memcg);
> >
> > parent_memcg). When the child is done with it, it can add it back to
> > it's own LRU via:
> >
> >         list_lru_add(...., child_memcg).
> 
> Thanks Dave,
> 
> So this seems like something that would need to recurse up to the root
> cgroup, which makes me wonder if generic code could/should provide it.
> 
> list_lru_walk_node already does a bit of policy here, where it walks
> the non-memcg lru, then walks the per-memcg ones,

That's a part of the  generic "walk everything in the LRU" API
functionality for list_lru. It isn't policy at all - if a caller
wants to iterate the entire LRU (e.g. to purge it), we have to walk
all the memcgs to do that. i.e. the memcg walk is an API
implementation detail required for correct behaviour of memcg-aware
list_lrus.

It also isn't an ordered scan at all - it iterates the memcgs by
increasing memcg ID and so should be considered a random order scan
in terms of cgroup heirarchy. The xarray index is maintained
internally by the list_lru infrastructure to optimise reparenting
and "walk everything" operations - it only tracks which memcgs
actually have objects stored in this list_lru.  It is not intended
to be in any way ordered or visible externally.

Note that the "non-memcg" lru is actually the root memcg in a
list_lru that is configured with memcg support. Hence doing a
heirarchical top-down walk will walk the "non-memcg" LRU first.
e.g. see drop_slab_node() for the iteration, and how shrink_slab()
specifically handles the root memcg differently to redirect it at
the "non-memcg" shrinker control configuration that passes a NULL
memcg and hence operates on the "non-memcg" LRU.

This tight integration between the shrinkers and list_lru comes
about from two things: memcg support can be compiled out of the
kernel, and there are shrinkers and list_lrus that are not memcg
aware. In both these cases we do not track object in or iterate
memcgs. The code is written this way because it has to support both
static compile time memcg disablement and dynamic runtime selection
of memcg-awareness in the list_lru.

> I kinda need that but in reverse, where it walks the memcg, then its
> ancestors, then the non-memcg lru, just wondering if that makes sense
> in common code like list_lru_walk_node does?

Iterating cgroups in a specific order is not generic list_lru
functionality. Iterating cgroups is quite complex and requires
locking and reference counting to do correctly. e.g. look at
the top down heirarchy walk implemented by mem_cgroup_iter().

That sort of complexity does not belong in list_lru - if you need to
walk memcgs in a specific order, you should do so externally by
following all the croup specific rules for access and lifetimes.
Then you can use the list_lru node/memcg aware APIs to do perform
the list_lru manipulations you need to perform on the specific
internal list_lru list you have already guaranteed will exist and be
safe to access.

> > > I'm just not seeing where the code for 4 happens, but I'm not fully
> > > across this all yet either,
> >
> > You won't find it, because it doesn't do 4) at all - that's consumer
> > side functionality, not generic functionality. If you want to have a
> > pool that is owned by a parent memcg and charge/track it to a child
> > memcg on allocation, then you need to write the pool management code
> > that performs this management. The APIs are there to build this sort
> > of thing, but it's not generic functionality the list_lru provides.
> 
> I have the pool bits, just wasn't sure how generic the code to
> traverse the memcg lrus from the child to the root to see if any level
> has some pages in it's lru.

Once you have your node/cgroup iteration sorted, you can call
list_lru_count_one(lru, nid, memcg) to quickly and safely check if
the LRU for that {node, memcg} LRU contains anything.  If it does,
then you can traverse it.

This two-phase "low overhead count/costly scan" API is how
memcg-aware shrinkers efficiently skip empty LRUs. The actual
node/memcg iteration is completely external to list_lru, the
list-lru infrastructure simply provides efficient APIs to filter
which {node, memcg} tuples need to be worked on.

> I can write it in the consumer, but I do
> think it's quite like list_lru_walk_node just with a different
> allocation strategy.

I disagree - specifically ordered memcg traversal is not something
that the list_lru implementation is currently doing, nor should it
be doing.

-Dave.
-- 
Dave Chinner
david at fromorbit.com