[PATCH v2 00/11] new cgroup controller for gpu/drm subsystem

Tue Apr 14 13:26:45 UTC 2020

On Tue, Apr 14, 2020 at 3:14 PM Kenny Ho <y2kenny at gmail.com> wrote:
>
> Ok.  I was hoping you can clarify the contradiction between the
> existance of the spec below and your "not something any other gpu can
> reasonably support" statement.  I mean, OneAPI is Intel's spec and
> doesn't that at least make SubDevice support "reasonable" for one more
> vendor?
>
> Partisanship aside, as a drm co-maintainer, do you really not see the
> need for non-work-conserving way of distributing GPU as a resource?
> You recognized the latencies involved (although that's really just
> part of the story... time sharing is never going to be good enough
> even if your switching cost is zero.)  As a drm co-maintainer, are you
> suggesting GPU has no place in the HPC use case?

 So I did chat with people and my understanding for how this subdevice
stuff works is roughly, from least to most fine grained support:
- Not possible at all, hw doesn't have any such support
- The hw is actually not a single gpu, but a bunch of chips behind a
magic bridge/interconnect, and there's a scheduler load-balancing
stuff and you can't actually run on all "cores" in parallel with one
compute/3d job. So subdevices just give you some of these cores, but
from client api pov they're exactly as powerful as the full device. So
this kinda works like assigning an entire NUMA node, including all the
cpu cores and memory bandwidth and everything.
- Hw has multiple "engines" which share resources (like compute cores
or whatever) behind the scenes. There's no control over how this
sharing works really, and whether you have guarantees about minimal
execution resources or not. This kinda works like hyperthreading.
- Then finally we have the CU mask thing amdgpu has. Which works like
what you're proposing, works on amd.

So this isn't something that I think we should standardize in a
resource management framework like cgroups. Because it's a complete
mess. Note that _all_ the above things (including the "no subdevices"
one) are valid implementations of "subdevices" in the various specs.

Now on your question on "why was this added to various standards?"
because opencl has that too (and the rocm thing, and everything else
it seems). What I heard is that a few people pushed really hard, and
no one objected hard enough (because not having subdevices is a
standards compliant implementation), so that's why it happened. Just
because it's in various standards doesn't mean that a) it's actually
standardized in a useful fashion and b) something we should just
blindly adopt.

Also like where exactly did you understand that I'm against gpus in
HPC uses cases. Approaching this in a slightly less tribal way would
really, really help to get something landed (which I'd like to see
happen, personally). Always spinning this as an Intel vs AMD thing
like you do here with every reply really doesn't help moving this in.

So yeah stricter isolation is something customers want, it's just not
something we can really give out right now at a level below the
device.
-Daniel

>
> Regards,
> Kenny
>
> On Tue, Apr 14, 2020 at 8:52 AM Daniel Vetter <daniel at ffwll.ch> wrote:
> >
> > On Tue, Apr 14, 2020 at 2:47 PM Kenny Ho <y2kenny at gmail.com> wrote:
> > > On Tue, Apr 14, 2020 at 8:20 AM Daniel Vetter <daniel at ffwll.ch> wrote:
> > > > My understanding from talking with a few other folks is that
> > > > the cpumask-style CU-weight thing is not something any other gpu can
> > > > reasonably support (and we have about 6+ of those in-tree)
> > >
> > > How does Intel plan to support the SubDevice API as described in your
> > > own spec here:
> > > https://spec.oneapi.com/versions/0.7/oneL0/core/INTRO.html#subdevice-support
> >
> > I can't talk about whether future products might or might not support
> > stuff and in what form exactly they might support stuff or not support
> > stuff. Or why exactly that's even in the spec there or not.
> >
> > Geez
> > -Daniel
> > --
> > Daniel Vetter
> > Software Engineer, Intel Corporation
> > +41 (0) 79 365 57 48 - http://blog.ffwll.ch

-- 
Daniel Vetter
Software Engineer, Intel Corporation
+41 (0) 79 365 57 48 - http://blog.ffwll.ch