[Intel-xe] [PATCH v5 1/2] drm/xe: Notify Userspace when gt reset fails

Rodrigo Vivi rodrigo.vivi at intel.com
Thu Jul 20 16:32:39 UTC 2023


On Wed, Jul 19, 2023 at 01:51:37AM +0000, Ghimiray, Himal Prasad wrote:
> Hi Matt,
> 
> > -----Original Message-----
> > From: Roper, Matthew D <matthew.d.roper at intel.com>
> > Sent: 19 July 2023 05:23
> > To: Ghimiray, Himal Prasad <himal.prasad.ghimiray at intel.com>
> > Cc: intel-xe at lists.freedesktop.org; Vivi, Rodrigo <rodrigo.vivi at intel.com>
> > Subject: Re: [Intel-xe] [PATCH v5 1/2] drm/xe: Notify Userspace when gt reset
> > fails
> > 
> > On Tue, Jul 18, 2023 at 07:02:15PM +0530, Himal Prasad Ghimiray wrote:
> > > Send uevent in case of gt reset failure. This intimation can be used
> > > by userspace monitoring tool to do the device level reset/reboot when
> > > GT reset fails. udevadm can be used to monitor the uevents.
> > 
> > This seems a bit questionable.  In theory a GT reset shouldn't fail; if it does it
> > either means we've either got a major software bug or a really catastrophic
> > hardware failure.  Letting the kernel driver just blindly move forward after
> > reporting a uevent doesn't seem like a proper response to a huge problem
> > like that.  Relying on userspace to pick up the pieces seems insufficient.
> > 
> > Right now the return code from gt_reset() is just ignored and no meaningful
> > action is taken within the driver.  It seems like we should at least be putting
> > the device into some kind of 'unusable' state (similar to i915's 'wedged') so
> > that other parts of the driver know that everything is completely broken and
> > that they should stop trying to use the device (or some subset of the device).
> > We could also potentially attempt recovery ourselves by escalating to a
> > DriverFLR.
> 
> Agreed. These feature needs to be there in the driver.

indeed. one possibility is to remove the drm card but keep the pci up waiting
for the reset. Another is to 'wedge' like i915.

> 
> > 
> > I'm not sure why any userspace would ever care about the specific condition
> > of "gt reset failed" --- it seems like it would pretty much only care about the
> > device being in one of three states and wouldn't care about the driver-
> > internal details about how/why it's in one of
> > those:
> > 
> >  * Everything is good; can use device as normal
> >  * Device partially broken; it's no longer possible to use these
> >    specific GTs, engines, etc., but other units should still function as
> >    expected.
> >  * Absolutely everything is broken and nothing works.  You _might_ be
> >    able to unload the driver, issue a PCI FLR, and reload the driver,
> >    but otherwise all hope is lost and the device cannot be used anymore.
> > 
> > Even if we decide that some uevent is appropriate here, this is new uapi and
> > will need a real userspace consumer too.  The fact that we can watch any
> > uevents raised by the device with udevadm doesn't seem like it would be
> > sufficient for that purpose; that doesn't feel much different than claiming
> > 'cat' as the userspace consumer of a sysfs node or something.
> > What we really need for new uapi is a meaningful top-to-bottom solution
> > that gives us confidence that we have an appropriate and maintainable
> > interface defined.
> 
> https://spec.oneapi.io/level-zero/latest/sysman/api.html#_CPPv441ZES_EVENT_TYPE_FLAG_DEVICE_RESET_REQUIRED is the consumer for this uapi.
> The application can register for events ZES_EVENT_TYPE_FLAG_DEVICE_RESET_REQUIRED with Sysman and start listening to this event.
> Then Sysman would send ZES_EVENT_TYPE_FLAG_DEVICE_RESET_REQUIRED event to application, incase sysman recieves uevent "RESET_FAILED=1" or "RESET_REQUIRED=1" or if Sysman detects a repair have occurred.
> 
> Then after getting ZES_EVENT_TYPE_FLAG_DEVICE_RESET_REQUIRED event application could query zesDeviceGetState  to get the reason of RESET. And Sysman would tell the reason of reset as :
> ZES_RESET_REASON_FLAG_WEDGED: Sysman would try to create a context and if context creation fails due to EIO error, then device is determined to be wedged and reset reason is declared as WEDGED
> 
> And sysman triggers a SBR to reset the device.

yeap, this is already good by itself, even independent of that mode.

But we need to think about the possibility of removing the drm card above.
Maybe it would be better to add the uevent at the pci level rather then
the drm.
And maybe with a general device_status= variable.

+Joonas

> 
> BR
> Himal
> > 
> > 
> > Matt
> > 
> > >
> > > v2:
> > > -Add NULL check for xe_gt_hw_engine return(Aravind) -Arrange variables
> > > in Christmas tree order(Tejas) -Check GUC_GSC_OTHER_CLASS(Tejas)
> > >
> > > v3:
> > > - Rebase
> > > - Remove notification for engine reset failure (Rodrigo)
> > >
> > > v4
> > > - Rectify the comments in header file.
> > >
> > > Cc: Aravind Iddamsetty <aravind.iddamsetty at intel.com>
> > > Cc: Tejas Upadhyay <tejas.upadhyay at intel.com>
> > > Cc: Rodrigo Vivi <rodrigo.vivi at intel.com>
> > > Reviewed-by: Badal Nilawar <badal.nilawar at intel.com>
> > > Signed-off-by: Himal Prasad Ghimiray<himal.prasad.ghimiray at intel.com>
> > > Signed-off-by: Himal Prasad Ghimiray <himal.prasad.ghimiray at intel.com>
> > > ---
> > >  drivers/gpu/drm/xe/xe_gt.c | 18 ++++++++++++++++++
> > > include/uapi/drm/xe_drm.h  |  8 ++++++++
> > >  2 files changed, 26 insertions(+)
> > >
> > > diff --git a/drivers/gpu/drm/xe/xe_gt.c b/drivers/gpu/drm/xe/xe_gt.c
> > > index a21d44bfe9e8..1db4d610f2fd 100644
> > > --- a/drivers/gpu/drm/xe/xe_gt.c
> > > +++ b/drivers/gpu/drm/xe/xe_gt.c
> > > @@ -8,6 +8,7 @@
> > >  #include <linux/minmax.h>
> > >
> > >  #include <drm/drm_managed.h>
> > > +#include <drm/xe_drm.h>
> > >
> > >  #include "regs/xe_gt_regs.h"
> > >  #include "xe_bb.h"
> > > @@ -500,6 +501,20 @@ static int do_gt_restart(struct xe_gt *gt)
> > >  	return 0;
> > >  }
> > >
> > > +static void xe_uevent_gt_reset_failure(struct xe_device *xe, u8 id) {
> > > +	char *reset_event[5];
> > > +
> > > +	reset_event[0] = XE_RESET_FAILED_UEVENT "=1";
> > > +	reset_event[1] = "RESET_ENABLED=1";
> > > +	reset_event[2] = "RESET_UNIT=gt";
> > > +	reset_event[3] = kasprintf(GFP_KERNEL, "RESET_ID=%d", id);
> > > +	reset_event[4] = NULL;
> > > +	kobject_uevent_env(&xe->drm.primary->kdev->kobj, KOBJ_CHANGE,
> > > +reset_event);
> > > +
> > > +	kfree(reset_event[3]);
> > > +}
> > > +
> > >  static int gt_reset(struct xe_gt *gt)  {
> > >  	int err;
> > > @@ -549,6 +564,9 @@ static int gt_reset(struct xe_gt *gt)
> > >  	xe_device_mem_access_put(gt_to_xe(gt));
> > >  	xe_gt_err(gt, "reset failed (%pe)\n", ERR_PTR(err));
> > >
> > > +	/* Notify userspace about gt reset failure */
> > > +	xe_uevent_gt_reset_failure(gt_to_xe(gt), gt->info.id);
> > > +
> > >  	return err;
> > >  }
> > >
> > > diff --git a/include/uapi/drm/xe_drm.h b/include/uapi/drm/xe_drm.h
> > > index 347351a8f618..c28bb54812e5 100644
> > > --- a/include/uapi/drm/xe_drm.h
> > > +++ b/include/uapi/drm/xe_drm.h
> > > @@ -16,6 +16,14 @@ extern "C" {
> > >   * subject to backwards-compatibility constraints.
> > >   */
> > >
> > > +/*
> > > + * Uevents generated by xe on it's device node.
> > > + *
> > > + * XE_RESET_FAILED_UEVENT - Event is generated when attempt to reset
> > > + * gt fails. The value supplied with the event is always 1.
> > > + */
> > > +#define XE_RESET_FAILED_UEVENT "RESET_FAILED"
> > > +
> > >  /**
> > >   * struct xe_user_extension - Base class for defining a chain of extensions
> > >   *
> > > --
> > > 2.25.1
> > >
> > 
> > --
> > Matt Roper
> > Graphics Software Engineer
> > Linux GPU Platform Enablement
> > Intel Corporation


More information about the Intel-xe mailing list