[PATCH RFC v4 00/16] new cgroup controller for gpu/drm subsystem

Tue Sep 3 19:48:22 UTC 2019

On Tue, Sep 3, 2019 at 8:50 PM Tejun Heo <tj at kernel.org> wrote:
>
> Hello, Daniel.
>
> On Tue, Sep 03, 2019 at 09:55:50AM +0200, Daniel Vetter wrote:
> > > * While breaking up and applying control to different types of
> > >   internal objects may seem attractive to folks who work day in and
> > >   day out with the subsystem, they aren't all that useful to users and
> > >   the siloed controls are likely to make the whole mechanism a lot
> > >   less useful.  We had the same problem with cgroup1 memcg - putting
> > >   control of different uses of memory under separate knobs.  It made
> > >   the whole thing pretty useless.  e.g. if you constrain all knobs
> > >   tight enough to control the overall usage, overall utilization
> > >   suffers, but if you don't, you really don't have control over actual
> > >   usage.  For memcg, what has to be allocated and controlled is
> > >   physical memory, no matter how they're used.  It's not like you can
> > >   go buy more "socket" memory.  At least from the looks of it, I'm
> > >   afraid gpu controller is repeating the same mistakes.
> >
> > We do have quite a pile of different memories and ranges, so I don't
> > thinkt we're doing the same mistake here. But it is maybe a bit too
>
> I see.  One thing which caught my eyes was the system memory control.
> Shouldn't that be controlled by memcg?  Is there something special
> about system memory used by gpus?

I think system memory separate from vram makes sense. For one, vram is
like 10x+ faster than system memory, so we definitely want to have
good control on that. But maybe we only want one vram bucket overall
for the entire system?

The trouble with system memory is that gpu tasks pin that memory to
prep execution. There's two solutions:
- i915 has a shrinker. Lots (and I really mean lots) of pain with
direct reclaim recursion, which often means we can't free memory, and
we're angering the oom killer a lot. Plus it introduces real bad
latency spikes everywhere (gpu workloads are occasionally really slow,
think "worse than pageout to spinning rust" to get memory freed).
- ttm just has a global limit, set to 50% of system memory.

I do think a global system memory limit to tame the shrinker, without
the ttm approach of possible just wasting half your memory, could be
useful.

> > complicated, and exposes stuff that most users really don't care about.
>
> Could be from me not knowing much about gpus but definitely looks too
> complex to me.  I don't see how users would be able to alloate, vram,
> system memory and GART with reasonable accuracy.  memcg on cgroup2
> deals with just single number and that's already plenty challenging.

Yeah, especially wrt GART and some of the other more specialized
things I don't think there's any modern gpu were you can actually run
out of that stuff. At least not before you run out of every other kind
of memory (GART is just a remapping table to make system memory
visible to the gpu).

I'm also not sure of the bw limits, given all the fun we have on the
block io cgroups side. Aside from that the current bw limit only
controls the bw the kernel uses, userspace can submit unlimited
amounts of copying commands that use the same pcie links directly to
the gpu, bypassing this cg knob. Also, controlling execution time for
gpus is very tricky, since they work a lot more like a block io device
or maybe a network controller with packet scheduling, than a cpu.
-Daniel
-- 
Daniel Vetter
Software Engineer, Intel Corporation
+41 (0) 79 365 57 48 - http://blog.ffwll.ch