[PATCH 13/18] ttm/pool: enable memcg tracking and shrinker. (v2)

Mon Aug 4 09:22:05 UTC 2025

Sorry for the delayed response, just back from vacation.

On 22.07.25 01:16, David Airlie wrote:
>>>> @@ -162,7 +164,10 @@ static struct page *ttm_pool_alloc_page(struct ttm_pool *pool, gfp_t gfp_flags,
>>>>               p = alloc_pages_node(pool->nid, gfp_flags, order);
>>>>               if (p) {
>>>>                       p->private = order;
>>>> -                     mod_node_page_state(NODE_DATA(page_to_nid(p)), NR_GPU_ACTIVE, (1 << order));
>>>> +                     if (!mem_cgroup_charge_gpu_page(objcg, p, order, gfp_flags, false)) {
>>>
>>> Thinking more about it that is way to late. At this point we can't fail the allocation any more.
>>>
>>
>> I've tested it at least works, but there is a bit of a problem with
>> it, because if we fail a 10 order allocation, it tries to fallback
>> down the order hierarchy, when there is no point since it can't
>> account the maximum size.
>>
>>> Otherwise we either completely break suspend or don't account system allocations to the correctly any more after resume.
>>
>> When you say suspend here, do you mean for VRAM allocations, normal
>> system RAM allocations which are accounted here shouldn't have any
>> effect on suspend/resume since they stay where they are. Currently it
>> also doesn't try account for evictions at all.

Good point, I was not considering moves during suspend as evictions. But from the code flow that should indeed work for now.

What I meant is that after resume BOs are usually not moved back into VRAM immediately. Filling VRAM is rate limited to allow quick response of desktop applications after resume.

So at least temporary we hopelessly overcommit system memory after resume. But that problem potentially goes into the same bucked as general eviction.

> I've just traced the global swapin/out paths as well and those seem
> fine for memcg at this point, since they are called only after
> populate/unpopulate. Now I haven't addressed the new xe swap paths,
> because I don't have a test path, since amdgpu doesn't support those,
> I was thinking I'd leave it on the list for when amdgpu goes to that
> path, or I can spend some time on xe.

I would really prefer that before we commit this that we have patches for both amdgpu and XE which at least demonstrate the functionality.

We are essentially defining uAPI here and when that goes wrong we can't change it any more as soon as people start depending on it.

> 
> Dave.
> 
>>>
>>> What we need is to reserve the memory on BO allocation and commit it when the TT backend is populated.
>>
>> I'm not sure what reserve vs commit is here, mem cgroup is really just
>> reserve until you can reserve no more, it's just a single
>> charge/uncharge stage. If we try and charge and we are over the limit,
>> bad things will happen, either fail allocation or reclaim for the
>> cgroup.

Yeah, exactly that is what I think is highly problematic.

When the allocation of a buffer for an application fails in the display server you basically open up the possibility for a deny of service.

E.g. imaging that an application allocates a 4GiB BO while it's cgroup says it can only allocate 2GiB, that will work because the backing store is only allocated delayed. Now send that BO to the display server and the command submission in the display server will fail with an -ENOMEM because we exceed the cgroup of the application.

As far as I can see we also need to limit how much an application can overcommit by creating BOs without backing store.

Alternatively disallow creating BOs without backing store, but that is an uAPI change and will break at least some use cases.

Regards,
Christian.

>>
>> Regards,
>> Dave.
>