[PATCH 12/17] ttm: add objcg pointer to bo and tt

Wed Jul 2 08:24:05 UTC 2025

On 02.07.25 09:57, David Airlie wrote:
>>>
>>> It makes it easier now, but when we have to solve swapping, step one
>>> will be moving all this code around to what I have now, and starting
>>> from there.
>>>
>>> This just raises the bar to solving the next problem.
>>>
>>> We need to find incremental approaches to getting all the pieces of
>>> the puzzle solved, or else we will still be here in 10 years.
>>>
>>> The steps I've formulated (none of them are perfect, but they all seem
>>> better than status quo)
>>>
>>> 1. add global counters for pages - now we can at least see things in
>>> vmstat and per-node
>>> 2. add numa to the pool lru - we can remove our own numa code and
>>> align with core kernel - probably doesn't help anything
>>
>> So far no objections from my side to that.
>>
>>> 3. add memcg awareness to the pool and pool shrinker.
>>>     if you are on a APU with no swap configured - you have a lot better time.
>>>     if you are on a dGPU or APU with swap - you have a moderately
>>> better time, but I can't see you having a worse time.
>>
>> Well that's what I'm strongly disagreeing on.
>>
>> Adding memcg to the pool has no value at all and complicates things massively when moving forward.
>>
>> What exactly should be the benefit of that?
> 
> I'm already showing the benefit of the pool moving to memcg, we've
> even talked about it multiple times on the list, it's not a OMG change
> the world benefit, but it definitely provides better alignment between
> the pool and memcg allocations.
> 
> We expose userspace API to allocate write combined memory, we do this
> for all currently supported CPU/GPUs. We might think in the future we
> don't want to continue to do this, but we do it now. My Fedora 42
> desktop uses it, even if you say there is no need.
> 
> If I allocate 100% of my memcg budget to WC memory, free it, then
> allocate 100% of my budget to non-WC memory, we break container
> containment as we can force other cgroups to run out of memory budget
> and have to call the global shrinker.

Yeah and that is perfectly intentional.

> With this in place, the
> container that allocates the WC memory also pays the price to switch
> it back. Again this is just correctness, it's not going to fix any
> major workloads, but I also don't think it should cause any
> regressions, since it won't be worse than current worst case
> expectation for most workloads.

No, this is not correct behavior any more.

Memory which is used by your cgroup is not used for allocations by another cgroup any more nor given back to the core memory managment for the page pool. E.g. one cgroup can't steal the memory from another cgroup any more.

In other words that is reserving the memory for the cgroup and don't give it back to the global pool as soon as you free it.

That would only be acceptable if we have per cgroup limit on the pool size which is *much* lower than the current global limit we have.

Maybe we could register a memcg aware shrinker, but not make the LRU memcg aware or something like that.

As far as I can see that would give us the benefit of both approaches, the only problem is that we would have to do per cgroup counter tracking on our own.

That's why I asked if we could have TTM pool specific variables in the cgroup.

Another alternative would be to change the LRU so that we track per memcg, but allow stealing of pages between cgroups.

> I'm not just adding memcg awareness to the pool though, that is just
> completeness, I'm adding memcg awareness to all GPU system memory
> allocations, and making sure swapout works (which it does), swapin
> probably needs more work.
> 
> The future work is integerating ttm swap mechanisms with memcg to get it right.
>>>
>>> Accounting at the resource level makes stuff better, but I don't
>>> believe after implementing it that it is consistent with solving the
>>> overall problem.
>>
>> Exactly that's my point. See accounting is no problem at all, that can be done on any possible level.
>>
>> What is tricky is shrinking, e.g. either core MM or memcg asking to reduce the usage of memory and moving things into swap.
>>
>> And that can only be done either on the resource level or the tt object, but not the pool level.
> 
> I understand we have to add more code to the tt level and that's fine,
> I just don't see why you think starting at the bottom level is wrong?
> it clearly has a use, and it's just cleaning up and preparing the
> levels, so we can move up and solve the next problem.

Because we don't have the necessary functionality to implement a memcg aware shrinker which moves BOs into swap there.

>> The whole TTM pool is to aid a 28 year old HW design which has no practical relevance on modern systems and we should really not touch that in any way possible.
> 
> Modern systems are still using it, I'm still seeing WC allocations,
> they still seem to have some cost associated with them on x86-64, they
> certainly aren't free. I don't care if they aren't practical, but if
> they are a way to route around container containment, they need to be
> fixed.

Completely agree to that, but we can't regress anything existing while doing so.

And I'm pretty sure that we do have use cases which rely on that behavior, so those use cases will regress.

Christian.

> 
> Dave.
>