[rfc] drm/ttm/memcg: simplest initial memcg/ttm integration (v2)

Fri May 23 17:06:53 UTC 2025

Hello, Christian.

On Fri, May 23, 2025 at 09:58:58AM +0200, Christian König wrote:
...
> > - There's a GPU workload which uses a sizable amount of system memory for
> >   the pool being discussed in this thread. This GPU workload is very
> >   important, so we want to make sure that other activities in the system
> >   don't bother it. We give it plenty of isolated CPUs and protect its memory
> >   with high enough memory.low.
> 
> That situation simply doesn't happen. See isolation is *not* a requirement
> for the pool.
...
> See the submission model of GPUs is best effort. E.g. you don't guarantee
> any performance isolation between processes whatsoever. If we would start
> to do this we would need to start re-designing the HW.

This is a radical claim. Let's table the rest of the discussion for now. I
don't know enough to tell whether this claim is true or not, but for this to
be true, the following should be true:

 Whether the GPU memory pool is reclaimed or not doesn't have noticeable
 performance implications on the GPU performance.

Is this true?

As for the scenario that I described above, I didn't just come up with it.
I'm only supporting from system side but that's based on what our ML folks
are doing right now. We have a bunch of lage machines with multiple GPUs
running ML workloads. The workloads can run for a long time spread across
many machines and they synchronize frequently, so any performance drop on
one GPU lowers utiliization on all involved GPUs which can go up to three
digits. For example, any scheduling disturbances on the submitting thread
propagates through the whole cluster and slows down all involved GPUs.

Also, because these machines are large on the CPU and memory sides too and
aren't doing whole lot other than managing the GPUs, people want to put on a
significant amount of CPU work on them which can easily create at least
moderate memory pressure. Is the claim that the combined write memory pool
doesn't have any meaningful impact on the GPU workload performance?

Thanks.

-- 
tejun