[PATCH 13/18] ttm/pool: enable memcg tracking and shrinker. (v2)

Wed Aug 6 13:04:22 UTC 2025

On 06.08.25 04:43, Dave Airlie wrote:
>>>>>
>>>>> What we need is to reserve the memory on BO allocation and commit it when the TT backend is populated.
>>>>
>>>> I'm not sure what reserve vs commit is here, mem cgroup is really just
>>>> reserve until you can reserve no more, it's just a single
>>>> charge/uncharge stage. If we try and charge and we are over the limit,
>>>> bad things will happen, either fail allocation or reclaim for the
>>>> cgroup.
>>
>> Yeah, exactly that is what I think is highly problematic.
>>
>> When the allocation of a buffer for an application fails in the display server you basically open up the possibility for a deny of service.
>>
>> E.g. imaging that an application allocates a 4GiB BO while it's cgroup says it can only allocate 2GiB, that will work because the backing store is only allocated delayed. Now send that BO to the display server and the command submission in the display server will fail with an -ENOMEM because we exceed the cgroup of the application.
>>
>> As far as I can see we also need to limit how much an application can overcommit by creating BOs without backing store.
>>
>> Alternatively disallow creating BOs without backing store, but that is an uAPI change and will break at least some use cases.
> 
> This is interesting, because I think the same DOS could exist now if
> the system is low on memory, I could allocate a giant unbacked BO and
> pass it to the display server now, and when it goes to fill in the
> pages it could fail to allocate pages and get ENOMEM?

Yeah that's perfectly possible. IIRC I have already pointed out those problems when I first started to work on radeon years ago.

See the patches for improving the OOM killer I came up ~10 years ago. They don't address exactly that problem, but go into the general direction.

The problem is cgroups makes those issues much worse because you suddenly have not only the general global limit of physically installed memory, but also artificial limits set by the system administrator.

> Should we be considering buffer sharing should cause population?

Good question, haven't though about that approach.

But we don't have a way to figure that out, don't we? Except maybe when exporting the BO as DMA-buf. Mhm, let me take a look at the code.

Christian.

> 
> Dave.