[PATCH] drm/xe/debugfs: Display gpuva info in debugfs

Fri Sep 20 15:57:19 UTC 2024

> -----Original Message-----
> From: Brost, Matthew <matthew.brost at intel.com>
> Sent: Thursday, September 19, 2024 3:02 PM
> To: Zeng, Oak <oak.zeng at intel.com>
> Cc: intel-xe at lists.freedesktop.org
> Subject: Re: [PATCH] drm/xe/debugfs: Display gpuva info in debugfs
> 
> On Wed, Sep 18, 2024 at 12:45:45PM -0400, Oak Zeng wrote:
> > Add debugfs entry to display all the gpuva that is bound to the
> > device. This is done by walking all the VMs created for each device
> > process, and display va information of each VM. This is useful for
> > gpuvm debugging. With this change, user can display gpuva info
> > as below:
> >
> > root at DUT7279PVC:/home/gta# cat
> /sys/kernel/debug/dri/0/gpuvas
> > Process "your process name" GPU VA space
> > DRM GPU VA space (Xe VM)
> [0x0000000000000000;0x0200000000000000]
> > Kernel reserved node [0x0000000000000000;0x0000000000000000]
> >
> >  VAs | start              | range              | end                | object             |
> object offset
> > -----------------------------------------------------------------------------------
> --------------------------
> >      | 0x0000000000000000 | 0x00007ffff5db7000 |
> 0x00007ffff5db7000 | 0x0000000000000000 | 0x0000000000000000
> >      | 0x00007ffff5db7000 | 0x0000000000001000 |
> 0x00007ffff5db8000 | 0x0000000000000000 | 0x00007ffff5db7000
> >      | 0x00007ffff5db8000 | 0x00ff80000a248000 |
> 0x0100000000000000 | 0x0000000000000000 | 0x0000000000000000
> >
> 
> This looks useful. We may even want to extend wedged mode to
> switch the
> VM to a state where this is available after a hang which causes a
> wedge
> as this could useful. Can be done in a follow up later.
> 
> > Signed-off-by: Oak Zeng <oak.zeng at intel.com>
> > ---
> >  drivers/gpu/drm/xe/xe_debugfs.c | 44
> +++++++++++++++++++++++++++++++++
> >  1 file changed, 44 insertions(+)
> >
> > diff --git a/drivers/gpu/drm/xe/xe_debugfs.c
> b/drivers/gpu/drm/xe/xe_debugfs.c
> > index 1011e5d281fa..0c7bee1c2a8d 100644
> > --- a/drivers/gpu/drm/xe/xe_debugfs.c
> > +++ b/drivers/gpu/drm/xe/xe_debugfs.c
> > @@ -9,6 +9,7 @@
> >  #include <linux/string_helpers.h>
> >
> >  #include <drm/drm_debugfs.h>
> > +#include <drm/drm_drv.h>
> >
> >  #include "xe_bo.h"
> >  #include "xe_device.h"
> > @@ -84,9 +85,52 @@ static int sriov_info(struct seq_file *m, void
> *data)
> >  	return 0;
> >  }
> >
> > +static int show_vm_gpuvas(struct xe_vm *vm, struct seq_file *m)
> > +{
> > +	int ret;
> > +
> > +	mutex_lock(&vm->snap_mutex);
> > +	ret = drm_debugfs_gpuva_info(m, &vm->gpuvm);
> > +	mutex_unlock(&vm->snap_mutex);
> > +
> > +	return ret;
> > +}
> > +
> > +static int show_each_vm(struct seq_file *m, void *arg)
> > +{
> > +	struct drm_info_node *node = (struct drm_info_node *)m-
> >private;
> > +	struct xe_device *xe = node_to_xe(node);
> > +	struct drm_device *dev = &xe->drm;
> > +	struct drm_file *file;
> > +	struct xe_file *xef;
> > +	int (*show)(struct xe_vm *, struct seq_file *) = node-
> >info_ent->data;
> > +	struct xe_vm *vm;
> > +	unsigned long idx;
> > +	int ret = 0;
> > +
> > +	mutex_lock(&dev->filelist_mutex);
> > +	list_for_each_entry(file, &dev->filelist, lhead) {
> > +		xef = (struct xe_file *)file->driver_priv;
> > +		seq_printf(m, "Process %s GPU VA space\n", xef-
> >process_name);
> > +		mutex_lock(&xef->vm.lock);
> > +		xa_for_each(&xef->vm.xa, idx, vm) {
> 
> Kinda a nit but I've said this in a few other places - the xef->vm.lock
> read usage isn't right here. This look protects the xarray lookup and
> ref to VM and nothing else. It is not designed to do anything else -
> e.g. holding the lock then calling another function.

I don't see a problem calling a function holding xef->vm.lock. we require
Vm to be alive while calling show_vm_gpuvas, and xef->vm.lock guarantees
Vm is alive because it guarantee vm is on the xarray and as long as it is on
The xarray we hold a vm ref.

> 
> With that this loop should look like this.
> 
> mutex_lock(&xef->vm.lock);
> xa_for_each(&xef->vm.xa, idx, vm) {

I don't think xa_for_each is safe againt removing items from xarray, 
At least this is my understanding.

> 	xe_vm_get(vm);
> 	mutex_unlock(&xef->vm.lock);
> 
> 	/* Do something */
> 
> 	mutex_lock(&xef->vm.lock);
> 	xe_vm_put(vm);
> }
> mutex_unlock(&xef->vm.lock);
> 
> This avoids unintentionally creating locking chains, in this case
> xef->vm.lock -> snap_mutex. We should strive to only create well
> defined
> changes of locking. Once we start unintentionally creating locking
> chains it hard to unwind in a driver.

Can we annotate the xef->vm.lock -> snap_mutex locking order?
It is reasonable to me from the locking hierarchy here: xef->vm.lock
Is at the xe file level, and snap_mutex is at vm level.  we can create
multiple vms to one xe file, so the locking order seems natural to me.

the key point is, not be afraid of creating locking chain, instead, creating
locking chain at reasonable order. Fortunately, the locking order in this
case is very obvious.

> 
> Likewise I suspect we really shouldn't be holding dev->filelist_mutex
> and doing things underneath it. Fixing that would likely require
> converting &dev->filelist to an xarray or writing an iterator with a
> hitch though so is a bit of a larger change. Looking at the usage of
> this lock, I'd say this is kinda abusing this lock too and I don't think
> anyone in the community would be all that happy about its usage
> here.
> I'd lean towards we need to properly fix this, or alternatively maybe
> just maintain a per device storage of all the VMs.

Take a look at drm_client_info function. Holding filelist_mutex and doing
Things underneath, including hold another lock, seems fine. That codes have
Been there for many years without a problem. It is just natural to hold
Filelist_mutex to walk device's file list as this lock is designed to protect the
File list. I don't think it is a abusing of this lock.

So I think it is reasonable to annotate a dev filelist_mutext->xef vm.lock->vm snap_mutex
Locking order. This is the same order of creating device ->device file descriptor for device->gpuvm
For a device file descriptor. Any reversed order locking is not reasonable thus a kernel driver
Bug.

Oak

> 
> Matt
> 
> > +			ret = show(vm, m);
> > +			if (ret < 0)
> > +				break;
> > +
> > +			seq_puts(m, "\n");
> > +		}
> > +		mutex_unlock(&xef->vm.lock);
> > +	}
> > +	mutex_unlock(&dev->filelist_mutex);
> > +
> > +	return ret;
> > +}
> > +
> >  static const struct drm_info_list debugfs_list[] = {
> >  	{"info", info, 0},
> >  	{ .name = "sriov_info", .show = sriov_info, },
> > +	DRM_DEBUGFS_GPUVA_INFO(show_each_vm,
> show_vm_gpuvas),
> >  };
> >
> >  static int forcewake_open(struct inode *inode, struct file *file)
> > --
> > 2.26.3
> >