[Intel-xe] [PATCH v5 1/2] drm/xe: Notify Userspace when gt reset fails

Mon Jul 24 08:35:16 UTC 2023


> -----Original Message-----
> From: Vivi, Rodrigo <rodrigo.vivi at intel.com>
> Sent: 20 July 2023 22:03
> To: Ghimiray, Himal Prasad <himal.prasad.ghimiray at intel.com>
> Cc: Roper, Matthew D <matthew.d.roper at intel.com>; intel-
> xe at lists.freedesktop.org; Joonas Lahtinen
> <joonas.lahtinen at linux.intel.com>
> Subject: Re: [Intel-xe] [PATCH v5 1/2] drm/xe: Notify Userspace when gt reset
> fails
> 
> On Wed, Jul 19, 2023 at 01:51:37AM +0000, Ghimiray, Himal Prasad wrote:
> > Hi Matt,
> >
> > > -----Original Message-----
> > > From: Roper, Matthew D <matthew.d.roper at intel.com>
> > > Sent: 19 July 2023 05:23
> > > To: Ghimiray, Himal Prasad <himal.prasad.ghimiray at intel.com>
> > > Cc: intel-xe at lists.freedesktop.org; Vivi, Rodrigo
> > > <rodrigo.vivi at intel.com>
> > > Subject: Re: [Intel-xe] [PATCH v5 1/2] drm/xe: Notify Userspace when
> > > gt reset fails
> > >
> > > On Tue, Jul 18, 2023 at 07:02:15PM +0530, Himal Prasad Ghimiray wrote:
> > > > Send uevent in case of gt reset failure. This intimation can be
> > > > used by userspace monitoring tool to do the device level
> > > > reset/reboot when GT reset fails. udevadm can be used to monitor the
> uevents.
> > >
> > > This seems a bit questionable.  In theory a GT reset shouldn't fail;
> > > if it does it either means we've either got a major software bug or
> > > a really catastrophic hardware failure.  Letting the kernel driver
> > > just blindly move forward after reporting a uevent doesn't seem like
> > > a proper response to a huge problem like that.  Relying on userspace to
> pick up the pieces seems insufficient.
> > >
> > > Right now the return code from gt_reset() is just ignored and no
> > > meaningful action is taken within the driver.  It seems like we
> > > should at least be putting the device into some kind of 'unusable'
> > > state (similar to i915's 'wedged') so that other parts of the driver
> > > know that everything is completely broken and that they should stop
> trying to use the device (or some subset of the device).
> > > We could also potentially attempt recovery ourselves by escalating
> > > to a DriverFLR.
> >
> > Agreed. These feature needs to be there in the driver.
> 
> indeed. one possibility is to remove the drm card but keep the pci up waiting
> for the reset. Another is to 'wedge' like i915.
> 
> >
> > >
> > > I'm not sure why any userspace would ever care about the specific
> > > condition of "gt reset failed" --- it seems like it would pretty
> > > much only care about the device being in one of three states and
> > > wouldn't care about the driver- internal details about how/why it's
> > > in one of
> > > those:
> > >
> > >  * Everything is good; can use device as normal
> > >  * Device partially broken; it's no longer possible to use these
> > >    specific GTs, engines, etc., but other units should still function as
> > >    expected.
> > >  * Absolutely everything is broken and nothing works.  You _might_ be
> > >    able to unload the driver, issue a PCI FLR, and reload the driver,
> > >    but otherwise all hope is lost and the device cannot be used anymore.
> > >
> > > Even if we decide that some uevent is appropriate here, this is new
> > > uapi and will need a real userspace consumer too.  The fact that we
> > > can watch any uevents raised by the device with udevadm doesn't seem
> > > like it would be sufficient for that purpose; that doesn't feel much
> > > different than claiming 'cat' as the userspace consumer of a sysfs node or
> something.
> > > What we really need for new uapi is a meaningful top-to-bottom
> > > solution that gives us confidence that we have an appropriate and
> > > maintainable interface defined.
> >
> > https://spec.oneapi.io/level-
> zero/latest/sysman/api.html#_CPPv441ZES_EVENT_TYPE_FLAG_DEVICE_RES
> ET_REQUIRED is the consumer for this uapi.
> > The application can register for events
> ZES_EVENT_TYPE_FLAG_DEVICE_RESET_REQUIRED with Sysman and start
> listening to this event.
> > Then Sysman would send
> ZES_EVENT_TYPE_FLAG_DEVICE_RESET_REQUIRED event to application,
> incase sysman recieves uevent "RESET_FAILED=1" or "RESET_REQUIRED=1"
> or if Sysman detects a repair have occurred.
> >
> > Then after getting ZES_EVENT_TYPE_FLAG_DEVICE_RESET_REQUIRED event
> application could query zesDeviceGetState  to get the reason of RESET. And
> Sysman would tell the reason of reset as :
> > ZES_RESET_REASON_FLAG_WEDGED: Sysman would try to create a context
> and
> > if context creation fails due to EIO error, then device is determined
> > to be wedged and reset reason is declared as WEDGED
> >
> > And sysman triggers a SBR to reset the device.
> 
> yeap, this is already good by itself, even independent of that mode.
> 
> But we need to think about the possibility of removing the drm card above.
> Maybe it would be better to add the uevent at the pci level rather then the
> drm.
> And maybe with a general device_status= variable.

Sure. Sounds logical. Will be uploading new patch for using pcie kobj instead of drm and will be going ahead with 
DEVICE_STATUS="NEEDS_RESET" RESET_FAILED=gt_id. 
Thanks Joonas for the offline inputs.
> 
> +Joonas
> 
> >
> > BR
> > Himal
> > >
> > >
> > > Matt
> > >
> > > >
> > > > v2:
> > > > -Add NULL check for xe_gt_hw_engine return(Aravind) -Arrange
> > > > variables in Christmas tree order(Tejas) -Check
> > > > GUC_GSC_OTHER_CLASS(Tejas)
> > > >
> > > > v3:
> > > > - Rebase
> > > > - Remove notification for engine reset failure (Rodrigo)
> > > >
> > > > v4
> > > > - Rectify the comments in header file.
> > > >
> > > > Cc: Aravind Iddamsetty <aravind.iddamsetty at intel.com>
> > > > Cc: Tejas Upadhyay <tejas.upadhyay at intel.com>
> > > > Cc: Rodrigo Vivi <rodrigo.vivi at intel.com>
> > > > Reviewed-by: Badal Nilawar <badal.nilawar at intel.com>
> > > > Signed-off-by: Himal Prasad
> > > > Ghimiray<himal.prasad.ghimiray at intel.com>
> > > > Signed-off-by: Himal Prasad Ghimiray
> > > > <himal.prasad.ghimiray at intel.com>
> > > > ---
> > > >  drivers/gpu/drm/xe/xe_gt.c | 18 ++++++++++++++++++
> > > > include/uapi/drm/xe_drm.h  |  8 ++++++++
> > > >  2 files changed, 26 insertions(+)
> > > >
> > > > diff --git a/drivers/gpu/drm/xe/xe_gt.c
> > > > b/drivers/gpu/drm/xe/xe_gt.c index a21d44bfe9e8..1db4d610f2fd
> > > > 100644
> > > > --- a/drivers/gpu/drm/xe/xe_gt.c
> > > > +++ b/drivers/gpu/drm/xe/xe_gt.c
> > > > @@ -8,6 +8,7 @@
> > > >  #include <linux/minmax.h>
> > > >
> > > >  #include <drm/drm_managed.h>
> > > > +#include <drm/xe_drm.h>
> > > >
> > > >  #include "regs/xe_gt_regs.h"
> > > >  #include "xe_bb.h"
> > > > @@ -500,6 +501,20 @@ static int do_gt_restart(struct xe_gt *gt)
> > > >  	return 0;
> > > >  }
> > > >
> > > > +static void xe_uevent_gt_reset_failure(struct xe_device *xe, u8 id) {
> > > > +	char *reset_event[5];
> > > > +
> > > > +	reset_event[0] = XE_RESET_FAILED_UEVENT "=1";
> > > > +	reset_event[1] = "RESET_ENABLED=1";
> > > > +	reset_event[2] = "RESET_UNIT=gt";
> > > > +	reset_event[3] = kasprintf(GFP_KERNEL, "RESET_ID=%d", id);
> > > > +	reset_event[4] = NULL;
> > > > +	kobject_uevent_env(&xe->drm.primary->kdev->kobj, KOBJ_CHANGE,
> > > > +reset_event);
> > > > +
> > > > +	kfree(reset_event[3]);
> > > > +}
> > > > +
> > > >  static int gt_reset(struct xe_gt *gt)  {
> > > >  	int err;
> > > > @@ -549,6 +564,9 @@ static int gt_reset(struct xe_gt *gt)
> > > >  	xe_device_mem_access_put(gt_to_xe(gt));
> > > >  	xe_gt_err(gt, "reset failed (%pe)\n", ERR_PTR(err));
> > > >
> > > > +	/* Notify userspace about gt reset failure */
> > > > +	xe_uevent_gt_reset_failure(gt_to_xe(gt), gt->info.id);
> > > > +
> > > >  	return err;
> > > >  }
> > > >
> > > > diff --git a/include/uapi/drm/xe_drm.h b/include/uapi/drm/xe_drm.h
> > > > index 347351a8f618..c28bb54812e5 100644
> > > > --- a/include/uapi/drm/xe_drm.h
> > > > +++ b/include/uapi/drm/xe_drm.h
> > > > @@ -16,6 +16,14 @@ extern "C" {
> > > >   * subject to backwards-compatibility constraints.
> > > >   */
> > > >
> > > > +/*
> > > > + * Uevents generated by xe on it's device node.
> > > > + *
> > > > + * XE_RESET_FAILED_UEVENT - Event is generated when attempt to
> > > > +reset
> > > > + * gt fails. The value supplied with the event is always 1.
> > > > + */
> > > > +#define XE_RESET_FAILED_UEVENT "RESET_FAILED"
> > > > +
> > > >  /**
> > > >   * struct xe_user_extension - Base class for defining a chain of
> extensions
> > > >   *
> > > > --
> > > > 2.25.1
> > > >
> > >
> > > --
> > > Matt Roper
> > > Graphics Software Engineer
> > > Linux GPU Platform Enablement
> > > Intel Corporation