[RFC v5 1/6] gpu: rfc: Proposal for a GPU cgroup controller

Wed Apr 20 23:52:19 UTC 2022

From: Hridya Valsaraju <hridya at google.com>

This patch adds a proposal for a new GPU cgroup controller for
accounting/limiting GPU and GPU-related memory allocations.
The proposed controller is based on the DRM cgroup controller[1] and
follows the design of the RDMA cgroup controller.

The new cgroup controller would:
* Allow setting per-device limits on the total size of buffers
  allocated by device within a cgroup.
* Expose a per-device/allocator breakdown of the buffers charged to a
  cgroup.

The prototype in the following patches is only for memory accounting
using the GPU cgroup controller and does not implement limit setting.

[1]: https://lore.kernel.org/amd-gfx/20210126214626.16260-1-brian.welty@intel.com/

Signed-off-by: Hridya Valsaraju <hridya at google.com>
Signed-off-by: T.J. Mercier <tjmercier at google.com>

---
v5 changes
Drop the global GPU cgroup "total" (sum of all device totals) portion
of the design since there is no currently known use for this per
Tejun Heo.

Update for renamed functions/variables.

v3 changes
Remove Upstreaming Plan from gpu-cgroup.rst per John Stultz.

Use more common dual author commit message format per John Stultz.
---
 Documentation/gpu/rfc/gpu-cgroup.rst | 190 +++++++++++++++++++++++++++
 Documentation/gpu/rfc/index.rst      |   4 +
 2 files changed, 194 insertions(+)
 create mode 100644 Documentation/gpu/rfc/gpu-cgroup.rst

diff --git a/Documentation/gpu/rfc/gpu-cgroup.rst b/Documentation/gpu/rfc/gpu-cgroup.rst
new file mode 100644
index 000000000000..0be2a3a9f641
--- /dev/null
+++ b/Documentation/gpu/rfc/gpu-cgroup.rst
@@ -0,0 +1,190 @@
+===================================
+GPU cgroup controller
+===================================
+
+Goals
+=====
+This document intends to outline a plan to create a cgroup v2 controller subsystem
+for the per-cgroup accounting of device and system memory allocated by the GPU
+and related subsystems.
+
+The new cgroup controller would:
+
+* Allow setting per-device limits on the total size of buffers allocated by a
+  device/allocator within a cgroup.
+
+* Expose a per-device/allocator breakdown of the buffers charged to a cgroup.
+
+Alternatives Considered
+=======================
+
+The following alternatives were considered:
+
+The memory cgroup controller
+____________________________
+
+1. As was noted in [1], memory accounting provided by the GPU cgroup
+controller is not a good fit for integration into memcg due to the
+differences in how accounting is performed. It implements a mechanism
+for the allocator attribution of GPU and GPU-related memory by
+charging each buffer to the cgroup of the process on behalf of which
+the memory was allocated. The buffer stays charged to the cgroup until
+it is freed regardless of whether the process retains any references
+to it. On the other hand, the memory cgroup controller offers a more
+fine-grained charging and uncharging behavior depending on the kind of
+page being accounted.
+
+2. Memcg performs accounting in units of pages. In the DMA-BUF buffer sharing model,
+a process takes a reference to the entire buffer(hence keeping it alive) even if
+it is only accessing parts of it. Therefore, per-page memory tracking for DMA-BUF
+memory accounting would only introduce additional overhead without any benefits.
+
+[1]: https://patchwork.kernel.org/project/dri-devel/cover/20190501140438.9506-1-brian.welty@intel.com/#22624705
+
+Userspace service to keep track of buffer allocations and releases
+__________________________________________________________________
+
+1. There is no way for a userspace service to intercept all allocations and releases.
+2. In case the process gets killed or restarted, we lose all accounting so far.
+
+UAPI
+====
+When enabled, the new cgroup controller would create the following files in every cgroup.
+
+::
+
+        gpu.memory.current (R)
+        gpu.memory.max (R/W)
+
+gpu.memory.current is a read-only file and would contain per-device memory allocations
+in a key-value format where key is a string representing the device name and the value
+is the size of memory charged to the device in the cgroup in bytes. The device name
+should be globally unique.
+
+For example:
+
+::
+
+        cat /sys/kernel/fs/cgroup1/gpu.memory.current
+        dev1 4194304
+        dev2 4194304
+
+The string key for each device is set by the device driver when the device registers
+with the GPU cgroup controller to participate in resource accounting (see section
+'Design and Implementation' for more details).
+
+gpu.memory.max is a read/write file. It would show the current size limits on
+memory usage for each allocator/device.
+
+Setting a limit for a particular device/allocator can be done as follows:
+
+::
+
+        echo “dev1 4194304” >  /sys/kernel/fs/cgroup1/gpu.memory.max
+
+In this example, 'dev1' is the string key set by the device driver during
+registration.
+
+Design and Implementation
+=========================
+
+The cgroup controller would closely follow the design of the RDMA cgroup controller
+subsystem where each cgroup maintains a list of resource pools.
+Each resource pool is associated with a device name via a pointer to a struct gpucg_bucket
+and contains a counter to track current, total, and the maximum limit set for the device.
+
+The below code block is a preliminary estimation on how the core kernel data structures
+and APIs would look like.
+
+.. code-block:: c
+
+        /* The GPU cgroup controller data structure */
+        struct gpucg {
+                struct cgroup_subsys_state css;
+
+                /* list of all resource pools that belong to this cgroup */
+                struct list_head rpools;
+        };
+
+        /* A named entity representing bucket of tracked memory. */
+        struct gpucg_bucket {
+                /* list of various resource pools in various cgroups that the bucket is part of */
+                struct list_head rpools;
+
+                /* list of all buckets registered for GPU cgroup accounting */
+                struct list_head bucket_node;
+
+                /* string to be used as identifier for accounting and limit setting */
+                const char *name;
+        };
+
+        struct gpucg_resource_pool {
+                /* The bucket whose resource usage is tracked by this resource pool */
+                struct gpucg_bucket *bucket;
+
+                /* list of all resource pools for the cgroup */
+                struct list_head cg_node;
+
+                /* list maintained by the gpucg_bucket to keep track of its resource pools */
+                struct list_head bucket_node;
+
+                /* tracks memory usage of the resource pool */
+                struct page_counter total;
+        };
+
+        /**
+         * gpucg_register_bucket - Registers a bucket for memory accounting using the
+         * GPU cgroup controller.
+         *
+         * @bucket: The bucket to register for memory accounting.
+         * @name: Pointer to a null-terminated string to denote the name of the bucket. This name
+         *        should be globally unique, and should not exceed @GPUCG_BUCKET_NAME_MAX_LEN bytes.
+         *
+         * @bucket must remain valid. @name will be copied.
+         */
+        void gpucg_register_bucket(struct gpucg_bucket *bucket, const char *name)
+
+        /**
+         * gpucg_charge - charge memory to the specified gpucg and gpucg_bucket.
+         *
+         * @gpucg: The gpu cgroup to charge the memory to.
+         * @bucket: The pool to charge the memory to.
+         * @size: The size of memory to charge in bytes.
+         *        This size will be rounded up to the nearest page size.
+         *
+         * Return: returns 0 if the charging is successful and otherwise returns an
+         * error code.
+         */
+        int gpucg_charge(struct gpucg *gpucg, struct gpucg_bucket *bucket, u64 size);
+
+        /**
+         * gpucg_uncharge - uncharge memory from the specified gpucg and gpucg_bucket.
+         * The caller must hold a reference to @gpucg obtained through gpucg_get().
+         *
+         * @gpucg: The gpu cgroup to uncharge the memory from.
+         * @bucket: The bucket to uncharge the memory from.
+         * @size: The size of memory to uncharge in bytes.
+         *        This size will be rounded up to the nearest page size.
+         */
+        void gpucg_uncharge(struct gpucg *gpucg, struct gpucg_bucket *bucket, u64 size);
+
+        /**
+         * gpucg_transfer_charge - Transfer a GPU charge from one cgroup to another.
+         *
+         * @source:	[in]	The GPU cgroup the charge will be transferred from.
+         * @dest:	[in]	The GPU cgroup the charge will be transferred to.
+         * @bucket:	[in]	The GPU cgroup bucket corresponding to the charge.
+         * @size:	[in]	The size of the memory in bytes.
+         *                      This size will be rounded up to the nearest page size.
+         *
+         * Returns 0 on success, or a negative errno code otherwise.
+         */
+        int gpucg_transfer_charge(struct gpucg *source,
+                                  struct gpucg *dest,
+                                  struct gpucg_bucket *bucket,
+                                  u64 size)
+
+
+Future Work
+===========
+Additional GPU resources can be supported by adding new controller files.
diff --git a/Documentation/gpu/rfc/index.rst b/Documentation/gpu/rfc/index.rst
index 91e93a705230..0a9bcd94e95d 100644
--- a/Documentation/gpu/rfc/index.rst
+++ b/Documentation/gpu/rfc/index.rst
@@ -23,3 +23,7 @@ host such documentation:
 .. toctree::
 
     i915_scheduler.rst
+
+.. toctree::
+
+    gpu-cgroup.rst
-- 
2.36.0.rc0.470.gd361397f0d-goog