[PATCH 12/17] ttm: add objcg pointer to bo and tt

Tue Jul 1 22:11:10 UTC 2025

On Tue, Jul 1, 2025 at 6:16 PM Christian König <christian.koenig at amd.com> wrote:
>
> On 01.07.25 10:06, David Airlie wrote:
> > On Tue, Jul 1, 2025 at 5:22 PM Christian König <christian.koenig at amd.com> wrote:
> >>>>> diff --git a/include/drm/ttm/ttm_tt.h b/include/drm/ttm/ttm_tt.h
> >>>>> index 15d4019685f6..c13fea4c2915 100644
> >>>>> --- a/include/drm/ttm/ttm_tt.h
> >>>>> +++ b/include/drm/ttm/ttm_tt.h
> >>>>> @@ -126,6 +126,8 @@ struct ttm_tt {
> >>>>>       enum ttm_caching caching;
> >>>>>       /** @restore: Partial restoration from backup state. TTM private */
> >>>>>       struct ttm_pool_tt_restore *restore;
> >>>>> +     /** @objcg: Object cgroup for this TT allocation */
> >>>>> +     struct obj_cgroup *objcg;
> >>>>>  };
> >>>>
> >>>> We should probably keep that out of the pool and account the memory to the BO instead.
> >>>>
> >>>
> >>> I tried that like 2-3 patch posting iterations ago, you suggested it
> >>> then, it didn't work. It has to be done at the pool level, I think it
> >>> was due to swap handling.
> >>
> >> When you do it at the pool level the swap/shrink handling is broken as well, just not for amdgpu.
> >>
> >> See xe_bo_shrink() and drivers/gpu/drm/xe/xe_shrinker.c on how XE does it.
> >
> > I've read all of that, but I don't think it needs changing yet, though
> > I do think I probably need to do a bit more work on the ttm
> > backup/restore paths to account things, but again we suffer from the
> > what happens if your cgroup runs out of space on a restore path,
> > similiar to eviction.
>
> My thinking was rather that because of this we do it at the resource level and keep memory accounted to whoever allocated it even if it's backed up or swapped out.
>
> > Blocking the problems we can solve now on the problems we've no idea
> > how to solve means nobody gets experience with solving anything.
>
> Well that's exactly the reason why I'm suggesting this. Ignoring swapping/backup for now seems to make things much easier.

It makes it easier now, but when we have to solve swapping, step one
will be moving all this code around to what I have now, and starting
from there.

This just raises the bar to solving the next problem.

We need to find incremental approaches to getting all the pieces of
the puzzle solved, or else we will still be here in 10 years.

The steps I've formulated (none of them are perfect, but they all seem
better than status quo)

1. add global counters for pages - now we can at least see things in
vmstat and per-node
2. add numa to the pool lru - we can remove our own numa code and
align with core kernel - probably doesn't help anything
3. add memcg awareness to the pool and pool shrinker.
    if you are on a APU with no swap configured - you have a lot better time.
    if you are on a dGPU or APU with swap - you have a moderately
better time, but I can't see you having a worse time.
4. look into tt level swapping and seeing how to integrate that lru
with numa/memcg awareness
    in theory we can do better than allocated_pages tracking, (I'd
like to burn that down, since it seems at odds with memcg)
5. look into xe swapping and see if we can integrate that numa/memcg better.

So the question I really want answered when I'm submitting patches
isn't, what does this not fix or not make better, but what does this
actively make worse than the status quo and is it heading in a
consistent direction to solve the problem.

Accounting at the resource level makes stuff better, but I don't
believe after implementing it that it is consistent with solving the
overall problem.

Dave.