[rfc] drm/ttm/memcg: simplest initial memcg/ttm integration (v2)

Mon May 26 08:19:43 UTC 2025

Hi Tejun,

On 5/23/25 19:06, Tejun Heo wrote:
> Hello, Christian.
> 
> On Fri, May 23, 2025 at 09:58:58AM +0200, Christian König wrote:
> ...
>>> - There's a GPU workload which uses a sizable amount of system memory for
>>>   the pool being discussed in this thread. This GPU workload is very
>>>   important, so we want to make sure that other activities in the system
>>>   don't bother it. We give it plenty of isolated CPUs and protect its memory
>>>   with high enough memory.low.
>>
>> That situation simply doesn't happen. See isolation is *not* a requirement
>> for the pool.
> ...
>> See the submission model of GPUs is best effort. E.g. you don't guarantee
>> any performance isolation between processes whatsoever. If we would start
>> to do this we would need to start re-designing the HW.
> 
> This is a radical claim. Let's table the rest of the discussion for now. I
> don't know enough to tell whether this claim is true or not, but for this to
> be true, the following should be true:
> 
>  Whether the GPU memory pool is reclaimed or not doesn't have noticeable
>  performance implications on the GPU performance.
> 
> Is this true?

Yes, that is true. Today the GPUs need the memory for correctness, not for performance anymore.

The performance improvements we have seen with this approach 15 or 20 years ago are negligible by todays standards.

It's just that Windows still offers the functionality today and when you bringup hardware on Linux you sometimes run into problems and find that the engineers who designed the hardware/firmware relied on having this. 

> As for the scenario that I described above, I didn't just come up with it.
> I'm only supporting from system side but that's based on what our ML folks
> are doing right now. We have a bunch of lage machines with multiple GPUs
> running ML workloads. The workloads can run for a long time spread across
> many machines and they synchronize frequently, so any performance drop on
> one GPU lowers utiliization on all involved GPUs which can go up to three
> digits. For example, any scheduling disturbances on the submitting thread
> propagates through the whole cluster and slows down all involved GPUs.

For the HPC/ML use case this feature is completely irrelevant. ROCm, Cuda, OpenCL, OpenMP etc... don't even expose something like this in their higher level APIs as far as I know.

Where this here matters is things like scanout on certain laptops, digital rights management in cloud gaming, hacks for getting high end GPUs to work on ARM boards (e.g. rasberry pie etc...).

> Also, because these machines are large on the CPU and memory sides too and
> aren't doing whole lot other than managing the GPUs, people want to put on a
> significant amount of CPU work on them which can easily create at least
> moderate memory pressure. Is the claim that the combined write memory pool
> doesn't have any meaningful impact on the GPU workload performance?

When the memory pool is active on such systems I would strongly advise to question why it is used in the first place.

The main reason why we still need it for business today is cloud gaming. And for this particular use case you absolutely do want to share the pool between cgroups or otherwise the whole use case breaks.

Regards,
Christian.

> 
> Thanks.
>