[rfc] drm/ttm/memcg: simplest initial memcg/ttm integration (v2)

Fri May 23 07:58:58 UTC 2025

Hi Tejun,

first of all thanks to Johannes and you for the input, it took me quite some time to actually get a grip on your concern here.

On 5/22/25 21:51, Tejun Heo wrote:
> Hello,
> 
> On Sat, May 17, 2025 at 06:25:02AM +1000, Dave Airlie wrote:
>> I think this is where we have 2 options:
>> (a) moving this stuff into core mm and out of shrinker context
>> (b) fix our shrinker to be cgroup aware and solve that first.
>>
>> The main question I have for Christian, is can you give me a list of
>> use cases that this will seriously negatively effect if we proceed
>> with (b).
> 
> This thread seems to have gone a bit haywire and we may be losing some
> context. I'm not sure not doing (b) is an option for acceptable isolation. I
> think Johannes already raised the issue but please consider the following
> scenario:
> 
> - There's a GPU workload which uses a sizable amount of system memory for
>   the pool being discussed in this thread. This GPU workload is very
>   important, so we want to make sure that other activities in the system
>   don't bother it. We give it plenty of isolated CPUs and protect its memory
>   with high enough memory.low.

That situation simply doesn't happen. See isolation is *not* a requirement for the pool.

It's basically the opposite, we use the pool to break the isolation between processes. 

> - Because most CPUs are largely idling while GPU is busy, there are plenty
>   of CPU cycles which can be used without impacting the GPU workload, so we
>   decide to do some data preprocessing which involves scanning large data
>   set creating memory usage which is mostly streaming but still has enough
>   look backs to promote them in the LRU lists.
> 
> IIUC, in the shared pool model, the GPU memory which isn't currently being
> used would sit outside the cgroup, and thus outside the protection of
> memory.low. Again, IIUC, you want to make this pool priority reclaimed
> because reclaiming is nearly free and you don't want to create undue
> pressure on other reclaimable resources.
> 
> However, what would happen in the above scenario under such implementation
> is that the GPU workload would keep losing its memory pool to the background
> memory pressure created by the streaming memory usage. It's also easy to
> expand on scenarios like this with other GPU workloads with differing
> priorities and memory allotments and so on.

Yes, and that is absolutely intentional.

That the GPU looses it's memory because of pressure on the CPU side is exactly what should happen.

There is no requirement whatsoever to avoid that.

> There may be some basic misunderstanding here. If a resource is worth
> caching, that usually indicates that there's some significant cost
> associated with un-caching the resource. It doesn't matter whether that cost
> is on the creation or destruction path. Here, the alloc path is expensive
> and free path is nearly free. However, this doesn't mean that we can get
> free isolation while bunching them together for immediate reclaim as others
> would be able to force you into alloc operations that you wouldn't need
> otherwise. If someone else can make you pay for something that you otherwise
> wouldn't, that resource is not isolated.

Correct, but exactly that is intentional.

See the submission model of GPUs is best effort. E.g. you don't guarantee any performance isolation between processes whatsoever. If we would start to do this we would need to start re-designing the HW.

We don't even have standard scheduling functionalities, for example we can't say process X has used N amount of cycles because that isn't even precisely know to the kernel.

When we would make the pool per cgroup we would loose the ability to quickly allocate those temporary throw away buffers which are mainly used for special HW use cases.

The important use cases are:

1. Sharing freed up memory between otherwise unrelated applications.

	Imagine you alt+tab or swipe between applications (which can be in different cgroups), and the old application frees us their write combined buffer and the new application re-allocates the same pages for their write combine buffer.

2. Switching use cases.

	E.g. developer first runs some benchmarks which uses a bunch of write combined memory and then starts gcc in the background (which of course can be in a different cgroup than the desktop) and that needs this additional memory.

That it is in theory possible that in between those milliseconds of the switch some CPU load could start to steal pages from the pool is completely irrelevant. For the second use case it is actually desirable.

If we would make this pool per cgroup we would actually need to add counter measures, e.g. separate cgroup controller which says how much memory each cgroup can hog and how much in total, and especially timeouts after the hogging cgroup would be forced to give it's pool memory back to the core memory management.

But I clearly don't see an use case for that.

Regards,
Christian.

> 
> Thanks.
>