[PATCH] procfs/dmabuf: Add /proc/<pid>/task/<tid>/dmabuf_fds

Michal Hocko mhocko at suse.com
Wed Jan 27 09:05:26 UTC 2021


[Cc linux-api as this is a new user interface]

On Tue 26-01-21 22:51:28, Kalesh Singh wrote:
> In order to measure how much memory a process actually consumes, it is
> necessary to include the DMA buffer sizes for that process in the memory
> accounting. Since the handle to DMA buffers are raw FDs, it is important
> to be able to identify which processes have FD references to a DMA buffer.
> 
> Currently, DMA buffer FDs can be accounted using /proc/<pid>/fd/* and
> /proc/<pid>/fdinfo -- both of which are only root readable, as follows:
>   1. Do a readlink on each FD.
>   2. If the target path begins with "/dmabuf", then the FD is a dmabuf FD.
>   3. stat the file to get the dmabuf inode number.
>   4. Read/ proc/<pid>/fdinfo/<fd>, to get the DMA buffer size.
> 
> Android captures per-process system memory state when certain low memory
> events (e.g a foreground app kill) occur, to identify potential memory
> hoggers. To include a process’s dmabuf usage as part of its memory state,
> the data collection needs to be fast enough to reflect the memory state at
> the time of such events.
> 
> Since reading /proc/<pid>/fd/ and /proc/<pid>/fdinfo/ requires root
> privileges, this approach is not suitable for production builds. Granting
> root privileges even to a system process increases the attack surface and
> is highly undesirable. Additionally this is slow as it requires many
> context switches for searching and getting the dma-buf info.
> 
> With the addition of per-buffer dmabuf stats in sysfs [1], the DMA buffer
> details can be queried using their unique inode numbers.
> 
> This patch proposes adding a /proc/<pid>/task/<tid>/dmabuf_fds interface.
> 
> /proc/<pid>/task/<tid>/dmabuf_fds contains a list of inode numbers for
> every DMA buffer FD that the task has. Entries with the same inode
> number can appear more than once, indicating the total FD references
> for the associated DMA buffer.
> 
> If a thread shares the same files as the group leader then its
> dmabuf_fds file will be empty, as these dmabufs are reported by the
> group leader.
> 
> The interface requires PTRACE_MODE_READ_FSCRED (same as /proc/<pid>/maps)
> and allows the efficient accounting of per-process DMA buffer usage without
> requiring root privileges. (See data below)
> 
> Performance Comparison:
> -----------------------
> 
> The following data compares the time to capture the sizes of all DMA
> buffers referenced by FDs for all processes on an arm64 android device.
> 
> -------------------------------------------------------
>                    |  Core 0 (Little)  |  Core 7 (Big) |
> -------------------------------------------------------
> >From <pid>/fdinfo  |      318 ms       |     145 ms    |
> -------------------------------------------------------
> Inodes from        |      114 ms       |      27 ms    |
> dmabuf_fds;        |    (2.8x  ^)      |   (5.4x  ^)   |
> data from sysfs    |                   |               |
> -------------------------------------------------------
> 
> It can be inferred that in the worst case there is a 2.8x speedup for
> determining per-process DMA buffer FD sizes, when using the proposed
> interfaces.
> 
> [1] https://lore.kernel.org/dri-devel/20210119225723.388883-1-hridya@google.com/
> 
> Signed-off-by: Kalesh Singh <kaleshsingh at google.com>
> ---
>  Documentation/filesystems/proc.rst |  30 ++++++
>  drivers/dma-buf/dma-buf.c          |   7 +-
>  fs/proc/Makefile                   |   1 +
>  fs/proc/base.c                     |   1 +
>  fs/proc/dma_bufs.c                 | 159 +++++++++++++++++++++++++++++
>  fs/proc/internal.h                 |   1 +
>  include/linux/dma-buf.h            |   5 +
>  7 files changed, 198 insertions(+), 6 deletions(-)
>  create mode 100644 fs/proc/dma_bufs.c
> 
> diff --git a/Documentation/filesystems/proc.rst b/Documentation/filesystems/proc.rst
> index 2fa69f710e2a..757dd47ab679 100644
> --- a/Documentation/filesystems/proc.rst
> +++ b/Documentation/filesystems/proc.rst
> @@ -47,6 +47,7 @@ fixes/update part 1.1  Stefani Seibold <stefani at seibold.net>    June 9 2009
>    3.10  /proc/<pid>/timerslack_ns - Task timerslack value
>    3.11	/proc/<pid>/patch_state - Livepatch patch operation state
>    3.12	/proc/<pid>/arch_status - Task architecture specific information
> +  3.13	/proc/<pid>/task/<tid>/dmabuf_fds - DMA buffers referenced by an FD
>  
>    4	Configuring procfs
>    4.1	Mount options
> @@ -2131,6 +2132,35 @@ AVX512_elapsed_ms
>    the task is unlikely an AVX512 user, but depends on the workload and the
>    scheduling scenario, it also could be a false negative mentioned above.
>  
> +3.13 /proc/<pid>/task/<tid>/dmabuf_fds - DMA buffers referenced by an FD
> +-------------------------------------------------------------------------
> +This file  exposes a list of the inode numbers for every DMA buffer
> +FD that the task has.
> +
> +The same inode number can appear more than once, indicating the total
> +FD references for the associated DMA buffer.
> +
> +The inode number can be used to lookup the DMA buffer information in
> +the sysfs interface /sys/kernel/dmabuf/buffers/<inode-no>/.
> +
> +Example Output
> +~~~~~~~~~~~~~~
> +$ cat /proc/612/task/612/dmabuf_fds
> +30972 30973 45678 49326
> +
> +Permission to access this file is governed by a ptrace access mode
> +PTRACE_MODE_READ_FSCREDS.
> +
> +Threads can have different files when created without specifying
> +the CLONE_FILES flag. For this reason the interface is presented as
> +/proc/<pid>/task/<tid>/dmabuf_fds and not /proc/<pid>/dmabuf_fds.
> +This simplifies kernel code and aggregation can be handled in
> +userspace.
> +
> +If a thread has the same files as its group leader, then its dmabuf_fds
> +file will be empty as these dmabufs are already reported by the
> +group leader.
> +
>  Chapter 4: Configuring procfs
>  =============================
>  
> diff --git a/drivers/dma-buf/dma-buf.c b/drivers/dma-buf/dma-buf.c
> index 9ad6397aaa97..0660c06be4c6 100644
> --- a/drivers/dma-buf/dma-buf.c
> +++ b/drivers/dma-buf/dma-buf.c
> @@ -29,8 +29,6 @@
>  #include <uapi/linux/dma-buf.h>
>  #include <uapi/linux/magic.h>
>  
> -static inline int is_dma_buf_file(struct file *);
> -
>  struct dma_buf_list {
>  	struct list_head head;
>  	struct mutex lock;
> @@ -434,10 +432,7 @@ static const struct file_operations dma_buf_fops = {
>  	.show_fdinfo	= dma_buf_show_fdinfo,
>  };
>  
> -/*
> - * is_dma_buf_file - Check if struct file* is associated with dma_buf
> - */
> -static inline int is_dma_buf_file(struct file *file)
> +int is_dma_buf_file(struct file *file)
>  {
>  	return file->f_op == &dma_buf_fops;
>  }
> diff --git a/fs/proc/Makefile b/fs/proc/Makefile
> index bd08616ed8ba..91a67f43ddf4 100644
> --- a/fs/proc/Makefile
> +++ b/fs/proc/Makefile
> @@ -16,6 +16,7 @@ proc-y	+= cmdline.o
>  proc-y	+= consoles.o
>  proc-y	+= cpuinfo.o
>  proc-y	+= devices.o
> +proc-y	+= dma_bufs.o
>  proc-y	+= interrupts.o
>  proc-y	+= loadavg.o
>  proc-y	+= meminfo.o
> diff --git a/fs/proc/base.c b/fs/proc/base.c
> index b3422cda2a91..af15a60b9831 100644
> --- a/fs/proc/base.c
> +++ b/fs/proc/base.c
> @@ -3598,6 +3598,7 @@ static const struct pid_entry tid_base_stuff[] = {
>  #ifdef CONFIG_SECCOMP_CACHE_DEBUG
>  	ONE("seccomp_cache", S_IRUSR, proc_pid_seccomp_cache),
>  #endif
> +	REG("dmabuf_fds", 0444, proc_tid_dmabuf_fds_operations),
>  };
>  
>  static int proc_tid_base_readdir(struct file *file, struct dir_context *ctx)
> diff --git a/fs/proc/dma_bufs.c b/fs/proc/dma_bufs.c
> new file mode 100644
> index 000000000000..46ea9cf968ed
> --- /dev/null
> +++ b/fs/proc/dma_bufs.c
> @@ -0,0 +1,159 @@
> +// SPDX-License-Identifier: GPL-2.0
> +/*
> + * Per-process DMA-BUF Stats
> + *
> + * Copyright (C) 2021 Google LLC.
> + */
> +
> +#include <linux/dma-buf.h>
> +#include <linux/fdtable.h>
> +#include <linux/ptrace.h>
> +#include <linux/seq_file.h>
> +
> +#include "internal.h"
> +
> +struct dmabuf_fds_private {
> +	struct inode *inode;
> +	struct task_struct *task;
> +	struct file *dmabuf_file;
> +};
> +
> +static loff_t *next_dmabuf(struct dmabuf_fds_private *priv,
> +		loff_t *pos)
> +{
> +	struct fdtable *fdt;
> +	struct file *file;
> +
> +	rcu_read_lock();
> +	fdt = files_fdtable(priv->task->files);
> +	for (; *pos < fdt->max_fds; ++*pos) {
> +		file = files_lookup_fd_rcu(priv->task->files, (unsigned int) *pos);
> +		if (file && is_dma_buf_file(file) && get_file_rcu(file)) {
> +			priv->dmabuf_file = file;
> +			break;
> +		}
> +	}
> +	if (*pos >= fdt->max_fds)
> +		pos = NULL;
> +	rcu_read_unlock();
> +
> +	return pos;
> +}
> +
> +static void *dmabuf_fds_seq_start(struct seq_file *s, loff_t *pos)
> +{
> +	struct dmabuf_fds_private *priv = s->private;
> +	struct files_struct *group_leader_files;
> +
> +	priv->task = get_proc_task(priv->inode);
> +
> +	if (!priv->task)
> +		return ERR_PTR(-ESRCH);
> +
> +	/* Hold task lock for duration that files need to be stable */
> +	task_lock(priv->task);
> +
> +	/*
> +	 * If this task is not the group leader but shares the same files, leave file empty.
> +	 * These dmabufs are already reported in the group leader's dmabuf_fds.
> +	 */
> +	group_leader_files = priv->task->group_leader->files;
> +	if (priv->task != priv->task->group_leader && priv->task->files == group_leader_files) {
> +		task_unlock(priv->task);
> +		put_task_struct(priv->task);
> +		priv->task = NULL;
> +		return NULL;
> +	}
> +
> +	return next_dmabuf(priv, pos);
> +}
> +
> +static void *dmabuf_fds_seq_next(struct seq_file *s, void *v, loff_t *pos)
> +{
> +	++*pos;
> +	return next_dmabuf(s->private, pos);
> +}
> +
> +static void dmabuf_fds_seq_stop(struct seq_file *s, void *v)
> +{
> +	struct dmabuf_fds_private *priv = s->private;
> +
> +	if (priv->task) {
> +		task_unlock(priv->task);
> +		put_task_struct(priv->task);
> +
> +	}
> +	if (priv->dmabuf_file)
> +		fput(priv->dmabuf_file);
> +}
> +
> +static int dmabuf_fds_seq_show(struct seq_file *s, void *v)
> +{
> +	struct dmabuf_fds_private *priv = s->private;
> +	struct file *file = priv->dmabuf_file;
> +	struct dma_buf *dmabuf = file->private_data;
> +
> +	if (!dmabuf)
> +		return -ESRCH;
> +
> +	seq_printf(s, "%8lu ", file_inode(file)->i_ino);
> +
> +	fput(priv->dmabuf_file);
> +	priv->dmabuf_file = NULL;
> +
> +	return 0;
> +}
> +
> +static const struct seq_operations proc_tid_dmabuf_fds_seq_ops = {
> +	.start = dmabuf_fds_seq_start,
> +	.next  = dmabuf_fds_seq_next,
> +	.stop  = dmabuf_fds_seq_stop,
> +	.show  = dmabuf_fds_seq_show
> +};
> +
> +static int proc_dmabuf_fds_open(struct inode *inode, struct file *file,
> +		     const struct seq_operations *ops)
> +{
> +	struct dmabuf_fds_private *priv;
> +	struct task_struct *task;
> +	bool allowed = false;
> +
> +	task = get_proc_task(inode);
> +	if (!task)
> +		return -ESRCH;
> +
> +	allowed = ptrace_may_access(task, PTRACE_MODE_READ_FSCREDS);
> +	put_task_struct(task);
> +
> +	if (!allowed)
> +		return -EACCES;
> +
> +	priv = __seq_open_private(file, ops, sizeof(*priv));
> +	if (!priv)
> +		return -ENOMEM;
> +
> +	priv->inode = inode;
> +	priv->task = NULL;
> +	priv->dmabuf_file = NULL;
> +
> +	return 0;
> +}
> +
> +static int proc_dmabuf_fds_release(struct inode *inode, struct file *file)
> +{
> +	return seq_release_private(inode, file);
> +}
> +
> +static int tid_dmabuf_fds_open(struct inode *inode, struct file *file)
> +{
> +	return proc_dmabuf_fds_open(inode, file,
> +			&proc_tid_dmabuf_fds_seq_ops);
> +}
> +
> +const struct file_operations proc_tid_dmabuf_fds_operations = {
> +	.open		= tid_dmabuf_fds_open,
> +	.read		= seq_read,
> +	.llseek		= seq_lseek,
> +	.release	= proc_dmabuf_fds_release,
> +};
> +
> diff --git a/fs/proc/internal.h b/fs/proc/internal.h
> index f60b379dcdc7..4ca74220db9c 100644
> --- a/fs/proc/internal.h
> +++ b/fs/proc/internal.h
> @@ -303,6 +303,7 @@ extern const struct file_operations proc_pid_smaps_operations;
>  extern const struct file_operations proc_pid_smaps_rollup_operations;
>  extern const struct file_operations proc_clear_refs_operations;
>  extern const struct file_operations proc_pagemap_operations;
> +extern const struct file_operations proc_tid_dmabuf_fds_operations;
>  
>  extern unsigned long task_vsize(struct mm_struct *);
>  extern unsigned long task_statm(struct mm_struct *,
> diff --git a/include/linux/dma-buf.h b/include/linux/dma-buf.h
> index cf72699cb2bc..087e11f7f193 100644
> --- a/include/linux/dma-buf.h
> +++ b/include/linux/dma-buf.h
> @@ -27,6 +27,11 @@ struct device;
>  struct dma_buf;
>  struct dma_buf_attachment;
>  
> +/**
> + * Check if struct file* is associated with dma_buf.
> + */
> +int is_dma_buf_file(struct file *file);
> +
>  /**
>   * struct dma_buf_ops - operations possible on struct dma_buf
>   * @vmap: [optional] creates a virtual mapping for the buffer into kernel
> -- 
> 2.30.0.280.ga3ce27912f-goog

-- 
Michal Hocko
SUSE Labs


More information about the dri-devel mailing list