[PATCH 12/17] ttm: add objcg pointer to bo and tt

David Airlie airlied at redhat.com
Thu Jul 3 05:53:37 UTC 2025


On Wed, Jul 2, 2025 at 6:24 PM Christian König <christian.koenig at amd.com> wrote:
>
> On 02.07.25 09:57, David Airlie wrote:
> >>>
> >>> It makes it easier now, but when we have to solve swapping, step one
> >>> will be moving all this code around to what I have now, and starting
> >>> from there.
> >>>
> >>> This just raises the bar to solving the next problem.
> >>>
> >>> We need to find incremental approaches to getting all the pieces of
> >>> the puzzle solved, or else we will still be here in 10 years.
> >>>
> >>> The steps I've formulated (none of them are perfect, but they all seem
> >>> better than status quo)
> >>>
> >>> 1. add global counters for pages - now we can at least see things in
> >>> vmstat and per-node
> >>> 2. add numa to the pool lru - we can remove our own numa code and
> >>> align with core kernel - probably doesn't help anything
> >>
> >> So far no objections from my side to that.
> >>
> >>> 3. add memcg awareness to the pool and pool shrinker.
> >>>     if you are on a APU with no swap configured - you have a lot better time.
> >>>     if you are on a dGPU or APU with swap - you have a moderately
> >>> better time, but I can't see you having a worse time.
> >>
> >> Well that's what I'm strongly disagreeing on.
> >>
> >> Adding memcg to the pool has no value at all and complicates things massively when moving forward.
> >>
> >> What exactly should be the benefit of that?
> >
> > I'm already showing the benefit of the pool moving to memcg, we've
> > even talked about it multiple times on the list, it's not a OMG change
> > the world benefit, but it definitely provides better alignment between
> > the pool and memcg allocations.
> >
> > We expose userspace API to allocate write combined memory, we do this
> > for all currently supported CPU/GPUs. We might think in the future we
> > don't want to continue to do this, but we do it now. My Fedora 42
> > desktop uses it, even if you say there is no need.
> >
> > If I allocate 100% of my memcg budget to WC memory, free it, then
> > allocate 100% of my budget to non-WC memory, we break container
> > containment as we can force other cgroups to run out of memory budget
> > and have to call the global shrinker.
>
> Yeah and that is perfectly intentional.

But it's wrong, we've been told by the mm/cgroup people that this
isn't correct behaviour and we should fix it, and in order to move
forward with fixing our other problems, we should start with this one.
We are violating cgroups containment and we should endeavour to stop
doing so.

>
> > With this in place, the
> > container that allocates the WC memory also pays the price to switch
> > it back. Again this is just correctness, it's not going to fix any
> > major workloads, but I also don't think it should cause any
> > regressions, since it won't be worse than current worst case
> > expectation for most workloads.
>
> No, this is not correct behavior any more.
>
> Memory which is used by your cgroup is not used for allocations by another cgroup any more nor given back to the core memory managment for the page pool. E.g. one cgroup can't steal the memory from another cgroup any more.
>
> In other words that is reserving the memory for the cgroup and don't give it back to the global pool as soon as you free it.

But what is the big advantage of giving it back to the global pool
here, I'm pretty neither the worst case or steady state behaviour will
change here, but the ability for one cgroup to help or hinder another
cgroup will be curtailed, which as far as I can see is what the cgroup
behaviour is meant to be. Each piece operates in it's own container,
and can cause minimal disruption either good or bad to other
containers.

> That would only be acceptable if we have per cgroup limit on the pool size which is *much* lower than the current global limit we have.

That is up to whoever configures the cgroup limits, if they say this
process should only have access to 1GB of RAM, then between normal RAM
and uncached/wc RAM they get 1GB, if they need to move RAM between
this via the ttm shrinker then it's all contained in that cgroup. This
isn't taking swapping into account, but currently we don't do that
now.

>
> Maybe we could register a memcg aware shrinker, but not make the LRU memcg aware or something like that.
>
> As far as I can see that would give us the benefit of both approaches, the only problem is that we would have to do per cgroup counter tracking on our own.
>
> That's why I asked if we could have TTM pool specific variables in the cgroup.
>
> Another alternative would be to change the LRU so that we track per memcg, but allow stealing of pages between cgroups.

I just don't get why we'd want to steal pages, just put all the
processes in the same cgroup. If you want to do that, we leave it up
to the cgroup administration to decide what they want to share between
processes. That policy shouldn't be in the driver/ttm layers, it
should be entirely configurable by the admin, and default to
reasonably sane behaviours.

If there is a system out there already using cgroups for containment,
but relying on this cgroup bypass to share uncached/wc pages, then
clearly it's not a great system, and we should figure out how to fix
that. If we need a backwards compat flag to turn this off, then I'm
fine with that, but we've been told by the cgroup folks that it's not
really a correct cgroup usage, and we should discourage it.

> > I understand we have to add more code to the tt level and that's fine,
> > I just don't see why you think starting at the bottom level is wrong?
> > it clearly has a use, and it's just cleaning up and preparing the
> > levels, so we can move up and solve the next problem.
>
> Because we don't have the necessary functionality to implement a memcg aware shrinker which moves BOs into swap there.

We need to have two levels of shrinker here, I'm not disputing the tt
level shinker like xe has doesn't need more work, but right now we
have two shrinkers that aren't aware of numa or memcg, I'd like to
start by reducing that to one for the corner case that nobody really
cares about but would be good to be correct. Then we can work on
swap/shrinker problem, which isn't this shrinker, and if after we do
that we find a single shrinker could be take care of it all, then we
move towards that.

Dave.



More information about the dri-devel mailing list