[PATCH RFC v4 00/16] new cgroup controller for gpu/drm subsystem

Fri Sep 6 15:34:16 UTC 2019

On Fri, Sep 6, 2019 at 5:23 PM Tejun Heo <tj at kernel.org> wrote:
>
> Hello, Daniel.
>
> On Tue, Sep 03, 2019 at 09:48:22PM +0200, Daniel Vetter wrote:
> > I think system memory separate from vram makes sense. For one, vram is
> > like 10x+ faster than system memory, so we definitely want to have
> > good control on that. But maybe we only want one vram bucket overall
> > for the entire system?
> >
> > The trouble with system memory is that gpu tasks pin that memory to
> > prep execution. There's two solutions:
> > - i915 has a shrinker. Lots (and I really mean lots) of pain with
> > direct reclaim recursion, which often means we can't free memory, and
> > we're angering the oom killer a lot. Plus it introduces real bad
> > latency spikes everywhere (gpu workloads are occasionally really slow,
> > think "worse than pageout to spinning rust" to get memory freed).
> > - ttm just has a global limit, set to 50% of system memory.
> >
> > I do think a global system memory limit to tame the shrinker, without
> > the ttm approach of possible just wasting half your memory, could be
> > useful.
>
> Hmm... what'd be the fundamental difference from slab or socket memory
> which are handled through memcg?  Is system memory used by GPUs have
> further global restrictions in addition to the amount of physical
> memory used?

Sometimes, but that would be specific resources (kinda like vram),
e.g. CMA regions used by a gpu. But probably not something you'll run
in a datacenter and want cgroups for ...

I guess we could try to integrate with the memcg group controller. One
trouble is that aside from i915 most gpu drivers do not really have a
full shrinker, so not sure how that would all integrate.

The overall gpu memory controller would still be outside of memcg I
think, since that would include swapped-out gpu objects, and stuff in
special memory regions like vram.

> > I'm also not sure of the bw limits, given all the fun we have on the
> > block io cgroups side. Aside from that the current bw limit only
> > controls the bw the kernel uses, userspace can submit unlimited
> > amounts of copying commands that use the same pcie links directly to
> > the gpu, bypassing this cg knob. Also, controlling execution time for
> > gpus is very tricky, since they work a lot more like a block io device
> > or maybe a network controller with packet scheduling, than a cpu.
>
> At the system level, it just gets folded into cpu time, which isn't
> perfect but is usually a good enough approximation of compute related
> dynamic resources.  Can gpu do someting similar or at least start with
> that?

So generally there's a pile of engines, often of different type (e.g.
amd hw has an entire pile of copy engines), with some ill-defined
sharing charateristics for some (often compute/render engines use the
same shader cores underneath), kinda like hyperthreading. So at that
detail it's all extremely hw specific, and probably too hard to
control in a useful way for users. And I'm not sure we can really do a
reasonable knob for overall gpu usage, e.g. if we include all the copy
engines, but the workloads are only running on compute engines, then
you might only get 10% overall utilization by engine-time. While the
shaders (which is most of the chip area/power consumption) are
actually at 100%. On top, with many userspace apis those engines are
an internal implementation detail of a more abstract gpu device (e.g.
opengl), but with others, this is all fully exposed (like vulkan).

Plus the kernel needs to use at least copy engines for vram management
itself, and you really can't take that away. Although Kenny here has
some proposal for a separate cgroup resource just for that.

I just think it's all a bit too ill-defined, and we might be better
off nailing the memory side first and get some real world experience
on this stuff. For context, there's not even a cross-driver standard
for how priorities are handled, that's all driver-specific interfaces.
-Daniel
-- 
Daniel Vetter
Software Engineer, Intel Corporation
+41 (0) 79 365 57 48 - http://blog.ffwll.ch