[PATCH RFC v4 16/16] drm/amdgpu: Integrate with DRM cgroup

Mon Dec 2 22:05:41 UTC 2019

> -----Original Message-----
> From: Kenny Ho <y2kenny at gmail.com>
> Sent: Friday, November 29, 2019 12:00 AM
> 
> Reducing audience since this is AMD specific.
> 
> On Tue, Oct 8, 2019 at 3:11 PM Kuehling, Felix <Felix.Kuehling at amd.com> wrote:
> >
> > On 2019-08-29 2:05 a.m., Kenny Ho wrote:
> > > The number of logical gpu (lgpu) is defined to be the number of
> > > compute unit (CU) for a device.  The lgpu allocation limit only
> > > applies to compute workload for the moment (enforced via kfd queue
> > > creation.)  Any cu_mask update is validated against the availability
> > > of the compute unit as defined by the drmcg the kfd process belongs to.
> >
> > There is something missing here. There is an API for the application
> > to specify a CU mask. Right now it looks like the
> > application-specified and CGroup-specified CU masks would clobber each
> > other. Instead the two should be merged.
> >
> > The CGroup-specified mask should specify a subset of CUs available for
> > application-specified CU masks. When the cgroup CU mask changes, you'd
> > need to take any application-specified CU masks into account before
> > updating the hardware.
> The idea behind the current implementation is to give sysadmin priority over user application (as that is the definition of control
> group.)  Mask specified by applicatoin/user is validated by pqm_drmcg_lgpu_validate and rejected with EACCES if they are not
> compatible.  The alternative is to ignore the difference and have the kernel guess/redistribute the assignment but I am not sure if this
> is a good approach since there is not enough information to allow the kernel to guess the user's intention correctly consistently.  (This
> is base on multiple conversations with you and Joe that, led me to believe, there are situation where spreading CU assignment across
> multiple SE is a good thing but not always.)
> 
> If the cgroup-specified mask is changed after the application has set the mask, the intersection of the two masks will be set instead.  It
> is possible to have no intersection and in this case no CU is made available to the application (just like the possibility for memcgroup to
> starve the amount of memory needed by an application.)

I don't disagree with forcing a user to work within an lgpu's allocation. But there's two minor problems here:

1) we will need a way for the process to query what the lgpu's bitmap looks like. You and Felix are somewhat discussing this below, but I don't think the KFD's "number of CUs" topology information is sufficient. I can know I have 32 CUs, but I don't know which 32 bits in the bitmask are turned on. But your code in pqm_drmcg_lgpu_validate() requires a subset when setting  CU mask on an lgpu. A user needs to know what bits are on in the LGPU for this to work.
2) Even if we have a query API, do we have an easy way to prevent a data race? Do we care? For instance, if I query the existing lgpu bitmap, then try to set a CU mask on a subset of that, it's possible that the lgpu will change between the query and set. That would make the setting fail, maybe that's good enough (you can just try in a loop until it succeeds?) 

Do empty CU masks actually work? This seems like something we would want to avoid. This could happen not infrequently if someone does something like:
* lgpu with half the CUs enabled
* User sets a mask to use half of those CUs
* lgpu is changed to enable the other half of the CUS --> now the user's mask is fully destroyed and everything dies. :\

> > The KFD topology APIs report the number of available CUs to the
> > application. CGroups would change that number at runtime and
> > applications would not expect that. I think the best way to deal with
> > that would be to have multiple bits in the application-specified CU
> > mask map to the same CU. How to do that in a fair way is not obvious.
> > I guess a more coarse-grain division of the GPU into LGPUs would make
> > this somewhat easier.
> Another possibility is to add namespace to the topology sysfs such that the correct number of CUs changes accordingly.  Although that
> wouldn't give the user the available mask that is made available by this implementation via the cgroup sysfs.  Another possibility is to
> modify the thunk similar to what was done for device cgroup (device
> re-mapping.)

I'd vote for a set of mask query APIs in the Thunk. One for the process's current CU mask, and one for a queue's current CU mask. We have a setter API already. Since the KFD topology information is also mirrored in sysfs, I would worry that a process would see different KFD topology information if it's querying the Thunk (which would show the lgpu's number of CUS0 vs. if it's reading sysfs (which would show the GPU's number of CUs).

As mentioned above, the KFD "num CUs" is insufficient for knowing how to set the CU bitmask, so I don't think we should rely on it in this case. IMO, KFD topology should describe the real hardware regardless of how cgroups is limiting things. I'm willing to be told this is a bad idea, though.

> > How is this problem handled for CPU cores and the interaction with CPU
> > pthread_setaffinity_np?
> Per the documentation of pthread_setaffinity_np, "If the call is successful, and the thread is not currently running on one of the CPUs
> in cpuset, then it is migrated to one of those CPUs."
> http://man7.org/linux/man-pages/man3/pthread_setaffinity_np.3.html
>
> Regards,
> Kenny
> 
> 
> 
> > Regards,
> >    Felix
> >
> >
> > >
> > > Change-Id: I69a57452c549173a1cd623c30dc57195b3b6563e
> > > Signed-off-by: Kenny Ho <Kenny.Ho at amd.com>
> > > ---
> > >   drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.h    |   4 +
> > >   drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c       |  21 +++
> > >   drivers/gpu/drm/amd/amdkfd/kfd_chardev.c      |   6 +
> > >   drivers/gpu/drm/amd/amdkfd/kfd_priv.h         |   3 +
> > >   .../amd/amdkfd/kfd_process_queue_manager.c    | 140 ++++++++++++++++++
> > >   5 files changed, 174 insertions(+)
> > >
> > > diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.h
> > > b/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.h
> > > index 55cb1b2094fd..369915337213 100644
> > > --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.h
> > > +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.h
> > > @@ -198,6 +198,10 @@ uint8_t amdgpu_amdkfd_get_xgmi_hops_count(struct kgd_dev *dst, struct kgd_dev *s
> > >               valid;                                                  \
> > >       })
> > >
> > > +int amdgpu_amdkfd_update_cu_mask_for_process(struct task_struct *task,
> > > +             struct amdgpu_device *adev, unsigned long *lgpu_bitmap,
> > > +             unsigned int nbits);
> > > +
> > >   /* GPUVM API */
> > >   int amdgpu_amdkfd_gpuvm_create_process_vm(struct kgd_dev *kgd, unsigned int pasid,
> > >                                       void **vm, void
> > > **process_info, diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c
> > > b/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c
> > > index 163a4fbf0611..8abeffdd2e5b 100644
> > > --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c
> > > +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c
> > > @@ -1398,9 +1398,29 @@ amdgpu_get_crtc_scanout_position(struct drm_device *dev, unsigned int pipe,
> > >   static void amdgpu_drmcg_custom_init(struct drm_device *dev,
> > >       struct drmcg_props *props)
> > >   {
> > > +     struct amdgpu_device *adev = dev->dev_private;
> > > +
> > > +     props->lgpu_capacity = adev->gfx.cu_info.number;
> > > +
> > >       props->limit_enforced = true;
> > >   }
> > >
> > > +static void amdgpu_drmcg_limit_updated(struct drm_device *dev,
> > > +             struct task_struct *task, struct drmcg_device_resource *ddr,
> > > +             enum drmcg_res_type res_type) {
> > > +     struct amdgpu_device *adev = dev->dev_private;
> > > +
> > > +     switch (res_type) {
> > > +     case DRMCG_TYPE_LGPU:
> > > +             amdgpu_amdkfd_update_cu_mask_for_process(task, adev,
> > > +                        ddr->lgpu_allocated, dev->drmcg_props.lgpu_capacity);
> > > +             break;
> > > +     default:
> > > +             break;
> > > +     }
> > > +}
> > > +
> > >   static struct drm_driver kms_driver = {
> > >       .driver_features =
> > >           DRIVER_USE_AGP | DRIVER_ATOMIC | @@ -1438,6 +1458,7 @@
> > > static struct drm_driver kms_driver = {
> > >       .gem_prime_mmap = amdgpu_gem_prime_mmap,
> > >
> > >       .drmcg_custom_init = amdgpu_drmcg_custom_init,
> > > +     .drmcg_limit_updated = amdgpu_drmcg_limit_updated,
> > >
> > >       .name = DRIVER_NAME,
> > >       .desc = DRIVER_DESC,
> > > diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_chardev.c
> > > b/drivers/gpu/drm/amd/amdkfd/kfd_chardev.c
> > > index 138c70454e2b..fa765b803f97 100644
> > > --- a/drivers/gpu/drm/amd/amdkfd/kfd_chardev.c
> > > +++ b/drivers/gpu/drm/amd/amdkfd/kfd_chardev.c
> > > @@ -450,6 +450,12 @@ static int kfd_ioctl_set_cu_mask(struct file *filp, struct kfd_process *p,
> > >               return -EFAULT;
> > >       }
> > >
> > > +     if (!pqm_drmcg_lgpu_validate(p, args->queue_id, properties.cu_mask, cu_mask_size)) {
> > > +             pr_debug("CU mask not permitted by DRM Cgroup");
> > > +             kfree(properties.cu_mask);
> > > +             return -EACCES;
> > > +     }
> > > +
> > >       mutex_lock(&p->mutex);
> > >
> > >       retval = pqm_set_cu_mask(&p->pqm, args->queue_id,
> > > &properties); diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_priv.h
> > > b/drivers/gpu/drm/amd/amdkfd/kfd_priv.h
> > > index 8b0eee5b3521..88881bec7550 100644
> > > --- a/drivers/gpu/drm/amd/amdkfd/kfd_priv.h
> > > +++ b/drivers/gpu/drm/amd/amdkfd/kfd_priv.h
> > > @@ -1038,6 +1038,9 @@ int pqm_get_wave_state(struct process_queue_manager *pqm,
> > >                      u32 *ctl_stack_used_size,
> > >                      u32 *save_area_used_size);
> > >
> > > +bool pqm_drmcg_lgpu_validate(struct kfd_process *p, int qid, u32 *cu_mask,
> > > +             unsigned int cu_mask_size);
> > > +
> > >   int amdkfd_fence_wait_timeout(unsigned int *fence_addr,
> > >                               unsigned int fence_value,
> > >                               unsigned int timeout_ms); diff --git
> > > a/drivers/gpu/drm/amd/amdkfd/kfd_process_queue_manager.c
> > > b/drivers/gpu/drm/amd/amdkfd/kfd_process_queue_manager.c
> > > index 7e6c3ee82f5b..a896de290307 100644
> > > --- a/drivers/gpu/drm/amd/amdkfd/kfd_process_queue_manager.c
> > > +++ b/drivers/gpu/drm/amd/amdkfd/kfd_process_queue_manager.c
> > > @@ -23,9 +23,11 @@
> > >
> > >   #include <linux/slab.h>
> > >   #include <linux/list.h>
> > > +#include <linux/cgroup_drm.h>
> > >   #include "kfd_device_queue_manager.h"
> > >   #include "kfd_priv.h"
> > >   #include "kfd_kernel_queue.h"
> > > +#include "amdgpu.h"
> > >   #include "amdgpu_amdkfd.h"
> > >
> > >   static inline struct process_queue_node *get_queue_by_qid( @@
> > > -167,6 +169,7 @@ static int create_cp_queue(struct process_queue_manager *pqm,
> > >                               struct queue_properties *q_properties,
> > >                               struct file *f, unsigned int qid)
> > >   {
> > > +     struct drmcg *drmcg;
> > >       int retval;
> > >
> > >       /* Doorbell initialized in user space*/ @@ -180,6 +183,36 @@
> > > static int create_cp_queue(struct process_queue_manager *pqm,
> > >       if (retval != 0)
> > >               return retval;
> > >
> > > +
> > > +     drmcg = drmcg_get(pqm->process->lead_thread);
> > > +     if (drmcg) {
> > > +             struct amdgpu_device *adev;
> > > +             struct drmcg_device_resource *ddr;
> > > +             int mask_size;
> > > +             u32 *mask;
> > > +
> > > +             adev = (struct amdgpu_device *) dev->kgd;
> > > +
> > > +             mask_size = adev->ddev->drmcg_props.lgpu_capacity;
> > > +             mask = kzalloc(sizeof(u32) * round_up(mask_size, 32),
> > > +                             GFP_KERNEL);
> > > +
> > > +             if (!mask) {
> > > +                     drmcg_put(drmcg);
> > > +                     uninit_queue(*q);
> > > +                     return -ENOMEM;
> > > +             }
> > > +
> > > +             ddr =
> > > + drmcg->dev_resources[adev->ddev->primary->index];
> > > +
> > > +             bitmap_to_arr32(mask, ddr->lgpu_allocated, mask_size);
> > > +
> > > +             (*q)->properties.cu_mask_count = mask_size;
> > > +             (*q)->properties.cu_mask = mask;
> > > +
> > > +             drmcg_put(drmcg);
> > > +     }
> > > +
> > >       (*q)->device = dev;
> > >       (*q)->process = pqm->process;
> > >
> > > @@ -495,6 +528,113 @@ int pqm_get_wave_state(struct process_queue_manager *pqm,
> > >                                                      save_area_used_size);
> > >   }
> > >
> > > +bool pqm_drmcg_lgpu_validate(struct kfd_process *p, int qid, u32 *cu_mask,
> > > +             unsigned int cu_mask_size) {
> > > +     DECLARE_BITMAP(curr_mask, MAX_DRMCG_LGPU_CAPACITY);
> > > +     struct drmcg_device_resource *ddr;
> > > +     struct process_queue_node *pqn;
> > > +     struct amdgpu_device *adev;
> > > +     struct drmcg *drmcg;
> > > +     bool result;
> > > +
> > > +     if (cu_mask_size > MAX_DRMCG_LGPU_CAPACITY)
> > > +             return false;
> > > +
> > > +     bitmap_from_arr32(curr_mask, cu_mask, cu_mask_size);
> > > +
> > > +     pqn = get_queue_by_qid(&p->pqm, qid);
> > > +     if (!pqn)
> > > +             return false;
> > > +
> > > +     adev = (struct amdgpu_device *)pqn->q->device->kgd;
> > > +
> > > +     drmcg = drmcg_get(p->lead_thread);
> > > +     ddr = drmcg->dev_resources[adev->ddev->primary->index];
> > > +
> > > +     if (bitmap_subset(curr_mask, ddr->lgpu_allocated,
> > > +                             MAX_DRMCG_LGPU_CAPACITY))
> > > +             result = true;
> > > +     else
> > > +             result = false;
> > > +
> > > +     drmcg_put(drmcg);
> > > +
> > > +     return result;
> > > +}
> > > +
> > > +int amdgpu_amdkfd_update_cu_mask_for_process(struct task_struct *task,
> > > +             struct amdgpu_device *adev, unsigned long *lgpu_bm,
> > > +             unsigned int lgpu_bm_size) {
> > > +     struct kfd_dev *kdev = adev->kfd.dev;
> > > +     struct process_queue_node *pqn;
> > > +     struct kfd_process *kfdproc;
> > > +     size_t size_in_bytes;
> > > +     u32 *cu_mask;
> > > +     int rc = 0;
> > > +
> > > +     if ((lgpu_bm_size % 32) != 0) {
> > > +             pr_warn("lgpu_bm_size %d must be a multiple of 32",
> > > +                             lgpu_bm_size);
> > > +             return -EINVAL;
> > > +     }
> > > +
> > > +     kfdproc = kfd_get_process(task);
> > > +
> > > +     if (IS_ERR(kfdproc))
> > > +             return -ESRCH;
> > > +
> > > +     size_in_bytes = sizeof(u32) * round_up(lgpu_bm_size, 32);
> > > +
> > > +     mutex_lock(&kfdproc->mutex);
> > > +     list_for_each_entry(pqn, &kfdproc->pqm.queues, process_queue_list) {
> > > +             if (pqn->q && pqn->q->device == kdev) {
> > > +                     /* update cu_mask accordingly */
> > > +                     cu_mask = kzalloc(size_in_bytes, GFP_KERNEL);
> > > +                     if (!cu_mask) {
> > > +                             rc = -ENOMEM;
> > > +                             break;
> > > +                     }
> > > +
> > > +                     if (pqn->q->properties.cu_mask) {
> > > +                             DECLARE_BITMAP(curr_mask,
> > > +
> > > + MAX_DRMCG_LGPU_CAPACITY);
> > > +
> > > +                             if (pqn->q->properties.cu_mask_count >
> > > +                                             lgpu_bm_size) {
> > > +                                     rc = -EINVAL;
> > > +                                     kfree(cu_mask);
> > > +                                     break;
> > > +                             }
> > > +
> > > +                             bitmap_from_arr32(curr_mask,
> > > +                                             pqn->q->properties.cu_mask,
> > > +
> > > + pqn->q->properties.cu_mask_count);
> > > +
> > > +                             bitmap_and(curr_mask, curr_mask, lgpu_bm,
> > > +                                             lgpu_bm_size);
> > > +
> > > +                             bitmap_to_arr32(cu_mask, curr_mask,
> > > +                                             lgpu_bm_size);
> > > +
> > > +                             kfree(curr_mask);
> > > +                     } else
> > > +                             bitmap_to_arr32(cu_mask, lgpu_bm,
> > > +                                             lgpu_bm_size);
> > > +
> > > +                     pqn->q->properties.cu_mask = cu_mask;
> > > +                     pqn->q->properties.cu_mask_count =
> > > + lgpu_bm_size;
> > > +
> > > +                     rc = pqn->q->device->dqm->ops.update_queue(
> > > +                                     pqn->q->device->dqm, pqn->q);
> > > +             }
> > > +     }
> > > +     mutex_unlock(&kfdproc->mutex);
> > > +
> > > +     return rc;
> > > +}
> > > +
> > >   #if defined(CONFIG_DEBUG_FS)
> > >
> > >   int pqm_debugfs_mqds(struct seq_file *m, void *data)