[PATCH RFC v4 14/16] drm, cgroup: Introduce lgpu as DRM cgroup resource

Wed Oct 9 15:23:31 UTC 2019

On Wed, Oct 09, 2019 at 11:08:45AM -0400, Kenny Ho wrote:
> Hi Daniel,
> 
> Can you elaborate what you mean in more details?  The goal of lgpu is
> to provide the ability to subdivide a GPU device and give those slices
> to different users as needed.  I don't think there is anything
> controversial or vendor specific here as requests for this are well
> documented.  The underlying representation is just a bitmap, which is
> neither unprecedented nor vendor specific (bitmap is used in cpuset
> for instance.)
> 
> An implementation of this abstraction is not hardware specific either.
> For example, one can associate a virtual function in SRIOV as a lgpu.
> Alternatively, a device can also declare to have 100 lgpus and treat
> the lgpu quantity as a percentage representation of GPU subdivision.
> The fact that an abstraction works well with a vendor implementation
> does not make it a "prettification" of a vendor feature (by this
> logic, I hope you are not implying an abstraction is only valid if it
> does not work with amd CU masking because that seems fairly partisan.)
> 
> Did I misread your characterization of this patch?

Scenario: I'm a gpgpu customer, and I type some gpgpu program (probably in
cude, transpiled for amd using rocm).

How does the stuff I'm seeing in cuda (or vk compute, or whatever) map to
the bitmasks I can set in this cgroup controller?

That's the stuff which this spec needs to explain. Currently the answer is
"amd CU masking", and that's not going to work on e.g. nvidia hw. We need
to come up with end-user relevant resources/meanings for these bits which
works across vendors.

On cpu a "cpu core" is rather well-defined, and customers actually know
what it means on intel, amd, ibm powerpc or arm. Both on the program side
(e.g. what do I need to stuff into relevant system calls to run on a
specific "cpu core") and on the admin side.

We need to achieve the same for gpus. "it's a bitmask" is not even close
enough imo.
-Daniel

> 
> Regards,
> Kenny
> 
> 
> On Wed, Oct 9, 2019 at 6:31 AM Daniel Vetter <daniel at ffwll.ch> wrote:
> >
> > On Tue, Oct 08, 2019 at 06:53:18PM +0000, Kuehling, Felix wrote:
> > > On 2019-08-29 2:05 a.m., Kenny Ho wrote:
> > > > drm.lgpu
> > > >          A read-write nested-keyed file which exists on all cgroups.
> > > >          Each entry is keyed by the DRM device's major:minor.
> > > >
> > > >          lgpu stands for logical GPU, it is an abstraction used to
> > > >          subdivide a physical DRM device for the purpose of resource
> > > >          management.
> > > >
> > > >          The lgpu is a discrete quantity that is device specific (i.e.
> > > >          some DRM devices may have 64 lgpus while others may have 100
> > > >          lgpus.)  The lgpu is a single quantity with two representations
> > > >          denoted by the following nested keys.
> > > >
> > > >            =====     ========================================
> > > >            count     Representing lgpu as anonymous resource
> > > >            list      Representing lgpu as named resource
> > > >            =====     ========================================
> > > >
> > > >          For example:
> > > >          226:0 count=256 list=0-255
> > > >          226:1 count=4 list=0,2,4,6
> > > >          226:2 count=32 list=32-63
> > > >
> > > >          lgpu is represented by a bitmap and uses the bitmap_parselist
> > > >          kernel function so the list key input format is a
> > > >          comma-separated list of decimal numbers and ranges.
> > > >
> > > >          Consecutively set bits are shown as two hyphen-separated decimal
> > > >          numbers, the smallest and largest bit numbers set in the range.
> > > >          Optionally each range can be postfixed to denote that only parts
> > > >          of it should be set.  The range will divided to groups of
> > > >          specific size.
> > > >          Syntax: range:used_size/group_size
> > > >          Example: 0-1023:2/256 ==> 0,1,256,257,512,513,768,769
> > > >
> > > >          The count key is the hamming weight / hweight of the bitmap.
> > > >
> > > >          Both count and list accept the max and default keywords.
> > > >
> > > >          Some DRM devices may only support lgpu as anonymous resources.
> > > >          In such case, the significance of the position of the set bits
> > > >          in list will be ignored.
> > > >
> > > >          This lgpu resource supports the 'allocation' resource
> > > >          distribution model.
> > > >
> > > > Change-Id: I1afcacf356770930c7f925df043e51ad06ceb98e
> > > > Signed-off-by: Kenny Ho <Kenny.Ho at amd.com>
> > >
> > > The description sounds reasonable to me and maps well to the CU masking
> > > feature in our GPUs.
> > >
> > > It would also allow us to do more coarse-grained masking for example to
> > > guarantee balanced allocation of CUs across shader engines or
> > > partitioning of memory bandwidth or CP pipes (if that is supported by
> > > the hardware/firmware).
> >
> > Hm, so this sounds like the definition for how this cgroup is supposed to
> > work is "amd CU masking" (whatever that exactly is). And the abstract
> > description is just prettification on top, but not actually the real
> > definition you guys want.
> >
> > I think adding a cgroup which is that much depending upon the hw
> > implementation of the first driver supporting it is not a good idea.
> > -Daniel
> >
> > >
> > > I can't comment on the code as I'm unfamiliar with the details of the
> > > cgroup code.
> > >
> > > Acked-by: Felix Kuehling <Felix.Kuehling at amd.com>
> > >
> > >
> > > > ---
> > > >   Documentation/admin-guide/cgroup-v2.rst |  46 ++++++++
> > > >   include/drm/drm_cgroup.h                |   4 +
> > > >   include/linux/cgroup_drm.h              |   6 ++
> > > >   kernel/cgroup/drm.c                     | 135 ++++++++++++++++++++++++
> > > >   4 files changed, 191 insertions(+)
> > > >
> > > > diff --git a/Documentation/admin-guide/cgroup-v2.rst b/Documentation/admin-guide/cgroup-v2.rst
> > > > index 87a195133eaa..57f18469bd76 100644
> > > > --- a/Documentation/admin-guide/cgroup-v2.rst
> > > > +++ b/Documentation/admin-guide/cgroup-v2.rst
> > > > @@ -1958,6 +1958,52 @@ DRM Interface Files
> > > >     Set largest allocation for /dev/dri/card1 to 4MB
> > > >     echo "226:1 4m" > drm.buffer.peak.max
> > > >
> > > > +  drm.lgpu
> > > > +   A read-write nested-keyed file which exists on all cgroups.
> > > > +   Each entry is keyed by the DRM device's major:minor.
> > > > +
> > > > +   lgpu stands for logical GPU, it is an abstraction used to
> > > > +   subdivide a physical DRM device for the purpose of resource
> > > > +   management.
> > > > +
> > > > +   The lgpu is a discrete quantity that is device specific (i.e.
> > > > +   some DRM devices may have 64 lgpus while others may have 100
> > > > +   lgpus.)  The lgpu is a single quantity with two representations
> > > > +   denoted by the following nested keys.
> > > > +
> > > > +     =====     ========================================
> > > > +     count     Representing lgpu as anonymous resource
> > > > +     list      Representing lgpu as named resource
> > > > +     =====     ========================================
> > > > +
> > > > +   For example:
> > > > +   226:0 count=256 list=0-255
> > > > +   226:1 count=4 list=0,2,4,6
> > > > +   226:2 count=32 list=32-63
> > > > +
> > > > +   lgpu is represented by a bitmap and uses the bitmap_parselist
> > > > +   kernel function so the list key input format is a
> > > > +   comma-separated list of decimal numbers and ranges.
> > > > +
> > > > +   Consecutively set bits are shown as two hyphen-separated decimal
> > > > +   numbers, the smallest and largest bit numbers set in the range.
> > > > +   Optionally each range can be postfixed to denote that only parts
> > > > +   of it should be set.  The range will divided to groups of
> > > > +   specific size.
> > > > +   Syntax: range:used_size/group_size
> > > > +   Example: 0-1023:2/256 ==> 0,1,256,257,512,513,768,769
> > > > +
> > > > +   The count key is the hamming weight / hweight of the bitmap.
> > > > +
> > > > +   Both count and list accept the max and default keywords.
> > > > +
> > > > +   Some DRM devices may only support lgpu as anonymous resources.
> > > > +   In such case, the significance of the position of the set bits
> > > > +   in list will be ignored.
> > > > +
> > > > +   This lgpu resource supports the 'allocation' resource
> > > > +   distribution model.
> > > > +
> > > >   GEM Buffer Ownership
> > > >   ~~~~~~~~~~~~~~~~~~~~
> > > >
> > > > diff --git a/include/drm/drm_cgroup.h b/include/drm/drm_cgroup.h
> > > > index 6d9707e1eb72..a8d6be0b075b 100644
> > > > --- a/include/drm/drm_cgroup.h
> > > > +++ b/include/drm/drm_cgroup.h
> > > > @@ -6,6 +6,7 @@
> > > >
> > > >   #include <linux/cgroup_drm.h>
> > > >   #include <linux/workqueue.h>
> > > > +#include <linux/types.h>
> > > >   #include <drm/ttm/ttm_bo_api.h>
> > > >   #include <drm/ttm/ttm_bo_driver.h>
> > > >
> > > > @@ -28,6 +29,9 @@ struct drmcg_props {
> > > >     s64                     mem_highs_default[TTM_PL_PRIV+1];
> > > >
> > > >     struct work_struct      *mem_reclaim_wq[TTM_PL_PRIV];
> > > > +
> > > > +   int                     lgpu_capacity;
> > > > +        DECLARE_BITMAP(lgpu_slots, MAX_DRMCG_LGPU_CAPACITY);
> > > >   };
> > > >
> > > >   #ifdef CONFIG_CGROUP_DRM
> > > > diff --git a/include/linux/cgroup_drm.h b/include/linux/cgroup_drm.h
> > > > index c56cfe74d1a6..7b1cfc4ce4c3 100644
> > > > --- a/include/linux/cgroup_drm.h
> > > > +++ b/include/linux/cgroup_drm.h
> > > > @@ -14,6 +14,8 @@
> > > >   /* limit defined per the way drm_minor_alloc operates */
> > > >   #define MAX_DRM_DEV (64 * DRM_MINOR_RENDER)
> > > >
> > > > +#define MAX_DRMCG_LGPU_CAPACITY 256
> > > > +
> > > >   enum drmcg_mem_bw_attr {
> > > >     DRMCG_MEM_BW_ATTR_BYTE_MOVED, /* for calulating 'instantaneous' bw */
> > > >     DRMCG_MEM_BW_ATTR_ACCUM_US,  /* for calulating 'instantaneous' bw */
> > > > @@ -32,6 +34,7 @@ enum drmcg_res_type {
> > > >     DRMCG_TYPE_MEM_PEAK,
> > > >     DRMCG_TYPE_BANDWIDTH,
> > > >     DRMCG_TYPE_BANDWIDTH_PERIOD_BURST,
> > > > +   DRMCG_TYPE_LGPU,
> > > >     __DRMCG_TYPE_LAST,
> > > >   };
> > > >
> > > > @@ -58,6 +61,9 @@ struct drmcg_device_resource {
> > > >     s64                     mem_bw_stats[__DRMCG_MEM_BW_ATTR_LAST];
> > > >     s64                     mem_bw_limits_bytes_in_period;
> > > >     s64                     mem_bw_limits_avg_bytes_per_us;
> > > > +
> > > > +   s64                     lgpu_used;
> > > > +   DECLARE_BITMAP(lgpu_allocated, MAX_DRMCG_LGPU_CAPACITY);
> > > >   };
> > > >
> > > >   /**
> > > > diff --git a/kernel/cgroup/drm.c b/kernel/cgroup/drm.c
> > > > index 0ea7f0619e25..18c4368e2c29 100644
> > > > --- a/kernel/cgroup/drm.c
> > > > +++ b/kernel/cgroup/drm.c
> > > > @@ -9,6 +9,7 @@
> > > >   #include <linux/cgroup_drm.h>
> > > >   #include <linux/ktime.h>
> > > >   #include <linux/kernel.h>
> > > > +#include <linux/bitmap.h>
> > > >   #include <drm/drm_file.h>
> > > >   #include <drm/drm_drv.h>
> > > >   #include <drm/ttm/ttm_bo_api.h>
> > > > @@ -52,6 +53,9 @@ static char const *mem_bw_attr_names[] = {
> > > >   #define MEM_BW_LIMITS_NAME_AVG "avg_bytes_per_us"
> > > >   #define MEM_BW_LIMITS_NAME_BURST "bytes_in_period"
> > > >
> > > > +#define LGPU_LIMITS_NAME_LIST "list"
> > > > +#define LGPU_LIMITS_NAME_COUNT "count"
> > > > +
> > > >   static struct drmcg *root_drmcg __read_mostly;
> > > >
> > > >   static int drmcg_css_free_fn(int id, void *ptr, void *data)
> > > > @@ -115,6 +119,10 @@ static inline int init_drmcg_single(struct drmcg *drmcg, struct drm_device *dev)
> > > >     for (i = 0; i <= TTM_PL_PRIV; i++)
> > > >             ddr->mem_highs[i] = dev->drmcg_props.mem_highs_default[i];
> > > >
> > > > +   bitmap_copy(ddr->lgpu_allocated, dev->drmcg_props.lgpu_slots,
> > > > +                   MAX_DRMCG_LGPU_CAPACITY);
> > > > +   ddr->lgpu_used = bitmap_weight(ddr->lgpu_allocated, MAX_DRMCG_LGPU_CAPACITY);
> > > > +
> > > >     mutex_unlock(&dev->drmcg_mutex);
> > > >     return 0;
> > > >   }
> > > > @@ -280,6 +288,14 @@ static void drmcg_print_limits(struct drmcg_device_resource *ddr,
> > > >                             MEM_BW_LIMITS_NAME_AVG,
> > > >                             ddr->mem_bw_limits_avg_bytes_per_us);
> > > >             break;
> > > > +   case DRMCG_TYPE_LGPU:
> > > > +           seq_printf(sf, "%s=%lld %s=%*pbl\n",
> > > > +                           LGPU_LIMITS_NAME_COUNT,
> > > > +                           ddr->lgpu_used,
> > > > +                           LGPU_LIMITS_NAME_LIST,
> > > > +                           dev->drmcg_props.lgpu_capacity,
> > > > +                           ddr->lgpu_allocated);
> > > > +           break;
> > > >     default:
> > > >             seq_puts(sf, "\n");
> > > >             break;
> > > > @@ -314,6 +330,15 @@ static void drmcg_print_default(struct drmcg_props *props,
> > > >                             MEM_BW_LIMITS_NAME_AVG,
> > > >                             props->mem_bw_avg_bytes_per_us_default);
> > > >             break;
> > > > +   case DRMCG_TYPE_LGPU:
> > > > +           seq_printf(sf, "%s=%d %s=%*pbl\n",
> > > > +                           LGPU_LIMITS_NAME_COUNT,
> > > > +                           bitmap_weight(props->lgpu_slots,
> > > > +                                   props->lgpu_capacity),
> > > > +                           LGPU_LIMITS_NAME_LIST,
> > > > +                           props->lgpu_capacity,
> > > > +                           props->lgpu_slots);
> > > > +           break;
> > > >     default:
> > > >             seq_puts(sf, "\n");
> > > >             break;
> > > > @@ -407,9 +432,21 @@ static void drmcg_value_apply(struct drm_device *dev, s64 *dst, s64 val)
> > > >     mutex_unlock(&dev->drmcg_mutex);
> > > >   }
> > > >
> > > > +static void drmcg_lgpu_values_apply(struct drm_device *dev,
> > > > +           struct drmcg_device_resource *ddr, unsigned long *val)
> > > > +{
> > > > +
> > > > +   mutex_lock(&dev->drmcg_mutex);
> > > > +   bitmap_copy(ddr->lgpu_allocated, val, MAX_DRMCG_LGPU_CAPACITY);
> > > > +   ddr->lgpu_used = bitmap_weight(ddr->lgpu_allocated, MAX_DRMCG_LGPU_CAPACITY);
> > > > +   mutex_unlock(&dev->drmcg_mutex);
> > > > +}
> > > > +
> > > >   static void drmcg_nested_limit_parse(struct kernfs_open_file *of,
> > > >             struct drm_device *dev, char *attrs)
> > > >   {
> > > > +   DECLARE_BITMAP(tmp_bitmap, MAX_DRMCG_LGPU_CAPACITY);
> > > > +   DECLARE_BITMAP(chk_bitmap, MAX_DRMCG_LGPU_CAPACITY);
> > > >     enum drmcg_res_type type =
> > > >             DRMCG_CTF_PRIV2RESTYPE(of_cft(of)->private);
> > > >     struct drmcg *drmcg = css_to_drmcg(of_css(of));
> > > > @@ -501,6 +538,83 @@ static void drmcg_nested_limit_parse(struct kernfs_open_file *of,
> > > >                             continue;
> > > >                     }
> > > >                     break; /* DRMCG_TYPE_MEM */
> > > > +           case DRMCG_TYPE_LGPU:
> > > > +                   if (strncmp(sname, LGPU_LIMITS_NAME_LIST, 256) &&
> > > > +                           strncmp(sname, LGPU_LIMITS_NAME_COUNT, 256) )
> > > > +                           continue;
> > > > +
> > > > +                        if (!strcmp("max", sval) ||
> > > > +                                   !strcmp("default", sval)) {
> > > > +                           if (parent != NULL)
> > > > +                                   drmcg_lgpu_values_apply(dev, ddr,
> > > > +                                           parent->dev_resources[minor]->
> > > > +                                           lgpu_allocated);
> > > > +                           else
> > > > +                                   drmcg_lgpu_values_apply(dev, ddr,
> > > > +                                           props->lgpu_slots);
> > > > +
> > > > +                           continue;
> > > > +                   }
> > > > +
> > > > +                   if (strncmp(sname, LGPU_LIMITS_NAME_COUNT, 256) == 0) {
> > > > +                           p_max = parent == NULL ? props->lgpu_capacity:
> > > > +                                   bitmap_weight(
> > > > +                                   parent->dev_resources[minor]->
> > > > +                                   lgpu_allocated, props->lgpu_capacity);
> > > > +
> > > > +                           rc = drmcg_process_limit_s64_val(sval,
> > > > +                                   false, p_max, p_max, &val);
> > > > +
> > > > +                           if (rc || val < 0) {
> > > > +                                   drmcg_pr_cft_err(drmcg, rc, cft_name,
> > > > +                                                   minor);
> > > > +                                   continue;
> > > > +                           }
> > > > +
> > > > +                           bitmap_zero(tmp_bitmap,
> > > > +                                           MAX_DRMCG_LGPU_CAPACITY);
> > > > +                           bitmap_set(tmp_bitmap, 0, val);
> > > > +                   }
> > > > +
> > > > +                   if (strncmp(sname, LGPU_LIMITS_NAME_LIST, 256) == 0) {
> > > > +                           rc = bitmap_parselist(sval, tmp_bitmap,
> > > > +                                           MAX_DRMCG_LGPU_CAPACITY);
> > > > +
> > > > +                           if (rc) {
> > > > +                                   drmcg_pr_cft_err(drmcg, rc, cft_name,
> > > > +                                                   minor);
> > > > +                                   continue;
> > > > +                           }
> > > > +
> > > > +                           bitmap_andnot(chk_bitmap, tmp_bitmap,
> > > > +                                   props->lgpu_slots,
> > > > +                                   MAX_DRMCG_LGPU_CAPACITY);
> > > > +
> > > > +                           if (!bitmap_empty(chk_bitmap,
> > > > +                                           MAX_DRMCG_LGPU_CAPACITY)) {
> > > > +                                   drmcg_pr_cft_err(drmcg, 0, cft_name,
> > > > +                                                   minor);
> > > > +                                   continue;
> > > > +                           }
> > > > +                   }
> > > > +
> > > > +
> > > > +                        if (parent != NULL) {
> > > > +                           bitmap_and(chk_bitmap, tmp_bitmap,
> > > > +                           parent->dev_resources[minor]->lgpu_allocated,
> > > > +                           props->lgpu_capacity);
> > > > +
> > > > +                           if (bitmap_empty(chk_bitmap,
> > > > +                                           props->lgpu_capacity)) {
> > > > +                                   drmcg_pr_cft_err(drmcg, 0,
> > > > +                                                   cft_name, minor);
> > > > +                                   continue;
> > > > +                           }
> > > > +                   }
> > > > +
> > > > +                   drmcg_lgpu_values_apply(dev, ddr, tmp_bitmap);
> > > > +
> > > > +                   break; /* DRMCG_TYPE_LGPU */
> > > >             default:
> > > >                     break;
> > > >             } /* switch (type) */
> > > > @@ -606,6 +720,7 @@ static ssize_t drmcg_limit_write(struct kernfs_open_file *of, char *buf,
> > > >                     break;
> > > >             case DRMCG_TYPE_BANDWIDTH:
> > > >             case DRMCG_TYPE_MEM:
> > > > +           case DRMCG_TYPE_LGPU:
> > > >                     drmcg_nested_limit_parse(of, dm->dev, sattr);
> > > >                     break;
> > > >             default:
> > > > @@ -731,6 +846,20 @@ struct cftype files[] = {
> > > >             .private = DRMCG_CTF_PRIV(DRMCG_TYPE_BANDWIDTH,
> > > >                                             DRMCG_FTYPE_DEFAULT),
> > > >     },
> > > > +   {
> > > > +           .name = "lgpu",
> > > > +           .seq_show = drmcg_seq_show,
> > > > +           .write = drmcg_limit_write,
> > > > +           .private = DRMCG_CTF_PRIV(DRMCG_TYPE_LGPU,
> > > > +                                           DRMCG_FTYPE_LIMIT),
> > > > +   },
> > > > +   {
> > > > +           .name = "lgpu.default",
> > > > +           .seq_show = drmcg_seq_show,
> > > > +           .flags = CFTYPE_ONLY_ON_ROOT,
> > > > +           .private = DRMCG_CTF_PRIV(DRMCG_TYPE_LGPU,
> > > > +                                           DRMCG_FTYPE_DEFAULT),
> > > > +   },
> > > >     { }     /* terminate */
> > > >   };
> > > >
> > > > @@ -744,6 +873,10 @@ struct cgroup_subsys drm_cgrp_subsys = {
> > > >
> > > >   static inline void drmcg_update_cg_tree(struct drm_device *dev)
> > > >   {
> > > > +        bitmap_zero(dev->drmcg_props.lgpu_slots, MAX_DRMCG_LGPU_CAPACITY);
> > > > +        bitmap_fill(dev->drmcg_props.lgpu_slots,
> > > > +                   dev->drmcg_props.lgpu_capacity);
> > > > +
> > > >     /* init cgroups created before registration (i.e. root cgroup) */
> > > >     if (root_drmcg != NULL) {
> > > >             struct cgroup_subsys_state *pos;
> > > > @@ -800,6 +933,8 @@ void drmcg_device_early_init(struct drm_device *dev)
> > > >     for (i = 0; i <= TTM_PL_PRIV; i++)
> > > >             dev->drmcg_props.mem_highs_default[i] = S64_MAX;
> > > >
> > > > +   dev->drmcg_props.lgpu_capacity = MAX_DRMCG_LGPU_CAPACITY;
> > > > +
> > > >     drmcg_update_cg_tree(dev);
> > > >   }
> > > >   EXPORT_SYMBOL(drmcg_device_early_init);
> >
> > --
> > Daniel Vetter
> > Software Engineer, Intel Corporation
> > http://blog.ffwll.ch

-- 
Daniel Vetter
Software Engineer, Intel Corporation
http://blog.ffwll.ch