[RFC] Generic cgroup controller for the gpu/drm subsystem

Thu Nov 1 23:02:02 UTC 2018

+dri-devel list since a lot of the relevant audience is on that list.

On Mon, Oct 29, 2018 at 07:49:13PM -0400, Kenny Ho wrote:
> (Resending in plain text)
> 
> Hi,
> 
> We are thinking of using cgroup to manage resources in GPUs.  I
> believe Matt Roper from Intel has been trying to do something similar.
> From previous discussions
> (https://www.spinics.net/lists/cgroups/msg18687.html), the cgroup
> framework appears to not want to have a full-fledged cgroup controller
> for Matt's use case but I am not sure if I understood the rationale.
> It's also unclear to me if our use case matches Matt's.  We are hoping
> to have a better understanding of the situation before embarking on a
> path that may ultimately be unacceptable to upstream.  To that end, I
> will outline our (AMD) use case at a high level and perhaps folks on
> this list can give some suggestions?
> 
> Our use case comes from the world of data center and cluster.
> Currently we have a rudimentary mechanism to expose GPUs to a
> container cluster running Kubernetes
> (https://github.com/RadeonOpenCompute/k8s-device-plugin) but it only
> exposes GPUs in whole.  That means multiple container cannot share the
> same GPU.  A well established way to share a GPU is to use
> SRIOV/virtualization but it shares the GPU in time slices.
> 
> An alternative is to share the GPU by its constituents.  Perhaps a
> good way to think about this is to treat the GPU like a mini-computer.
> A GPU has memory (VRAM) and it also has compute units (but instead 10s
> of cores, it has 100~1000 of shaders/CUs.)  So we can potentially
> share a GPU by those two dimensions.  Similar to a computer, a GPU
> also has specialized hardware so we can potentially share those
> separately as well.
> 
> Unlike a computer, however, GPUs are not as well "standardized" as a
> desktop or a server.  For the gpu/drm subsystem, there are something
> that are common (such as buffer sharing and buffer lifetime
> management), something that are shared by some vendors (software
> scheduler) and something that are very much vendor specific.  Due to
> this, a generic cgroup controller for drm may need to be more
> pluggable than other cgroup controller.  We took a look at the rdma
> cgroup as part of our research but rdma appears to have resources that
> are more abstracted and standardized.
> 
> What do you think?  Does drm/gpu warrant its own full-fledged cgroup controller?
> 
> Regards,
> Kenny Ho

Hi Kenny.  My drm+cgroups work from earlier this year has been on pause
at the moment since I got pulled away to focus on some other higher
priority tasks.  What I was working on previously still has value to
various parts of Intel, so I do plan to return to it eventually if
nobody else jumps in first; I'm just not sure exactly when I'll have
time to get back to it.

In general, there are several areas where gpu and drm subsystem behavior
could interact in some way with cgroup membership.  Some aspects of
graphics behavior would be a good match for controlling via a true
cgroup controller, whereas others probably make more sense to add as
driver or drm core interfaces that just pay attention to the cgroup
membership of a process.

A real cgroup controller is probably what we'd want to use for concepts
that map well to the hierarchical structure of cgroups and that can be
handled via one of the four models described in the "Resource
Distribution Models" section of Documentation/admin-guide/cgroup-v2.rst.
Off the top of my head, the graphics concepts that seem like a good
match for this are:

 * GPU memory management - At a high level, memory management fits into
   cgroup controller model well, but there are a lot of implementation
   details that would need to be agreed upon before someone starts
   writing a controller for this.  The way GPU memory gets allocated and
   shared between processes adds complexity to how you do the
   accounting, as does the diversity in types and levels of GPU memory
   supported by different vendors' GPU's (especially differences between
   what "GPU memory" even means on discrete vs integrated graphics).

 * GPU time (fair scheduler) - If you want to partition execution time
   on a GPU, a cgroup controller is a good match for that.

 * GPU engine/EU partitioning - I'm not familiar with the details of the
   specific hardware you're focusing on, but based on your description
   above, it sounds like it gives you a lot of flexibility to slice up
   your GPU execution units and submit independent workloads to
   arbitrary subsets of them?  If that's true, a cgroup controller could
   be used to balance how many EU's various cgroups have access to or to
   reserve dedicated subsets of EU's for the processes in specific
   cgroups to help provide QoS guarantees.  I don't think most of the
   hardware I work with is nearly that flexible at the EU level, but
   even on simpler hardware designs, a cgroup controller could probably
   partition access to the higher-level execution engines (e.g., it
   would be possible to specify that processes from a specific part of
   the cgroup hierarchy are the only ones with any access to the media
   engine).

On the other hand, some graphics concepts don't really care about the
overall cgroup hierarchy, but would like to make decisions based on
traits that have been assigned to the specific, individual cgroup a
process belongs to.  The specific use case I was working on before was
an example of this --- GPU priority in a system with a strictly
priority-based (non-fair, starvation allowed) scheduler.  While GPU
priority shares some similarity with the "GPU time" example I gave
above, priority itself isn't a resource that gets distributed the same
way that "GPU time" is.  The priority for any individual cgroup is
completely unrelated to the priority for any other cgroup, and the
cgroup's position in the hierarchy isn't interesting.  The consensus
when we discussed this before was that concepts like GPU priority (which
are more just about tagging groups of processes with a setting/value)
are better handled in the DRM subsystem itself.

Matt

-- 
Matt Roper
Graphics Software Engineer
IoTG Platform Enabling & Development
Intel Corporation
(916) 356-2795