[rfc] drm/ttm/memcg: simplest initial memcg/ttm integration (v2)

Thu May 22 19:51:10 UTC 2025

Hello,

On Sat, May 17, 2025 at 06:25:02AM +1000, Dave Airlie wrote:
> I think this is where we have 2 options:
> (a) moving this stuff into core mm and out of shrinker context
> (b) fix our shrinker to be cgroup aware and solve that first.
> 
> The main question I have for Christian, is can you give me a list of
> use cases that this will seriously negatively effect if we proceed
> with (b).

This thread seems to have gone a bit haywire and we may be losing some
context. I'm not sure not doing (b) is an option for acceptable isolation. I
think Johannes already raised the issue but please consider the following
scenario:

- There's a GPU workload which uses a sizable amount of system memory for
  the pool being discussed in this thread. This GPU workload is very
  important, so we want to make sure that other activities in the system
  don't bother it. We give it plenty of isolated CPUs and protect its memory
  with high enough memory.low.

- Because most CPUs are largely idling while GPU is busy, there are plenty
  of CPU cycles which can be used without impacting the GPU workload, so we
  decide to do some data preprocessing which involves scanning large data
  set creating memory usage which is mostly streaming but still has enough
  look backs to promote them in the LRU lists.

IIUC, in the shared pool model, the GPU memory which isn't currently being
used would sit outside the cgroup, and thus outside the protection of
memory.low. Again, IIUC, you want to make this pool priority reclaimed
because reclaiming is nearly free and you don't want to create undue
pressure on other reclaimable resources.

However, what would happen in the above scenario under such implementation
is that the GPU workload would keep losing its memory pool to the background
memory pressure created by the streaming memory usage. It's also easy to
expand on scenarios like this with other GPU workloads with differing
priorities and memory allotments and so on.

There may be some basic misunderstanding here. If a resource is worth
caching, that usually indicates that there's some significant cost
associated with un-caching the resource. It doesn't matter whether that cost
is on the creation or destruction path. Here, the alloc path is expensive
and free path is nearly free. However, this doesn't mean that we can get
free isolation while bunching them together for immediate reclaim as others
would be able to force you into alloc operations that you wouldn't need
otherwise. If someone else can make you pay for something that you otherwise
wouldn't, that resource is not isolated.

Thanks.

-- 
tejun