[Intel-xe] [PATCH v5 1/2] drm/xe: Notify Userspace when gt reset fails

Ghimiray, Himal Prasad himal.prasad.ghimiray at intel.com
Wed Jul 19 01:51:37 UTC 2023


Hi Matt,

> -----Original Message-----
> From: Roper, Matthew D <matthew.d.roper at intel.com>
> Sent: 19 July 2023 05:23
> To: Ghimiray, Himal Prasad <himal.prasad.ghimiray at intel.com>
> Cc: intel-xe at lists.freedesktop.org; Vivi, Rodrigo <rodrigo.vivi at intel.com>
> Subject: Re: [Intel-xe] [PATCH v5 1/2] drm/xe: Notify Userspace when gt reset
> fails
> 
> On Tue, Jul 18, 2023 at 07:02:15PM +0530, Himal Prasad Ghimiray wrote:
> > Send uevent in case of gt reset failure. This intimation can be used
> > by userspace monitoring tool to do the device level reset/reboot when
> > GT reset fails. udevadm can be used to monitor the uevents.
> 
> This seems a bit questionable.  In theory a GT reset shouldn't fail; if it does it
> either means we've either got a major software bug or a really catastrophic
> hardware failure.  Letting the kernel driver just blindly move forward after
> reporting a uevent doesn't seem like a proper response to a huge problem
> like that.  Relying on userspace to pick up the pieces seems insufficient.
> 
> Right now the return code from gt_reset() is just ignored and no meaningful
> action is taken within the driver.  It seems like we should at least be putting
> the device into some kind of 'unusable' state (similar to i915's 'wedged') so
> that other parts of the driver know that everything is completely broken and
> that they should stop trying to use the device (or some subset of the device).
> We could also potentially attempt recovery ourselves by escalating to a
> DriverFLR.

Agreed. These feature needs to be there in the driver.

> 
> I'm not sure why any userspace would ever care about the specific condition
> of "gt reset failed" --- it seems like it would pretty much only care about the
> device being in one of three states and wouldn't care about the driver-
> internal details about how/why it's in one of
> those:
> 
>  * Everything is good; can use device as normal
>  * Device partially broken; it's no longer possible to use these
>    specific GTs, engines, etc., but other units should still function as
>    expected.
>  * Absolutely everything is broken and nothing works.  You _might_ be
>    able to unload the driver, issue a PCI FLR, and reload the driver,
>    but otherwise all hope is lost and the device cannot be used anymore.
> 
> Even if we decide that some uevent is appropriate here, this is new uapi and
> will need a real userspace consumer too.  The fact that we can watch any
> uevents raised by the device with udevadm doesn't seem like it would be
> sufficient for that purpose; that doesn't feel much different than claiming
> 'cat' as the userspace consumer of a sysfs node or something.
> What we really need for new uapi is a meaningful top-to-bottom solution
> that gives us confidence that we have an appropriate and maintainable
> interface defined.

https://spec.oneapi.io/level-zero/latest/sysman/api.html#_CPPv441ZES_EVENT_TYPE_FLAG_DEVICE_RESET_REQUIRED is the consumer for this uapi.
The application can register for events ZES_EVENT_TYPE_FLAG_DEVICE_RESET_REQUIRED with Sysman and start listening to this event.
Then Sysman would send ZES_EVENT_TYPE_FLAG_DEVICE_RESET_REQUIRED event to application, incase sysman recieves uevent "RESET_FAILED=1" or "RESET_REQUIRED=1" or if Sysman detects a repair have occurred.

Then after getting ZES_EVENT_TYPE_FLAG_DEVICE_RESET_REQUIRED event application could query zesDeviceGetState  to get the reason of RESET. And Sysman would tell the reason of reset as :
ZES_RESET_REASON_FLAG_WEDGED: Sysman would try to create a context and if context creation fails due to EIO error, then device is determined to be wedged and reset reason is declared as WEDGED

And sysman triggers a SBR to reset the device.

BR
Himal
> 
> 
> Matt
> 
> >
> > v2:
> > -Add NULL check for xe_gt_hw_engine return(Aravind) -Arrange variables
> > in Christmas tree order(Tejas) -Check GUC_GSC_OTHER_CLASS(Tejas)
> >
> > v3:
> > - Rebase
> > - Remove notification for engine reset failure (Rodrigo)
> >
> > v4
> > - Rectify the comments in header file.
> >
> > Cc: Aravind Iddamsetty <aravind.iddamsetty at intel.com>
> > Cc: Tejas Upadhyay <tejas.upadhyay at intel.com>
> > Cc: Rodrigo Vivi <rodrigo.vivi at intel.com>
> > Reviewed-by: Badal Nilawar <badal.nilawar at intel.com>
> > Signed-off-by: Himal Prasad Ghimiray<himal.prasad.ghimiray at intel.com>
> > Signed-off-by: Himal Prasad Ghimiray <himal.prasad.ghimiray at intel.com>
> > ---
> >  drivers/gpu/drm/xe/xe_gt.c | 18 ++++++++++++++++++
> > include/uapi/drm/xe_drm.h  |  8 ++++++++
> >  2 files changed, 26 insertions(+)
> >
> > diff --git a/drivers/gpu/drm/xe/xe_gt.c b/drivers/gpu/drm/xe/xe_gt.c
> > index a21d44bfe9e8..1db4d610f2fd 100644
> > --- a/drivers/gpu/drm/xe/xe_gt.c
> > +++ b/drivers/gpu/drm/xe/xe_gt.c
> > @@ -8,6 +8,7 @@
> >  #include <linux/minmax.h>
> >
> >  #include <drm/drm_managed.h>
> > +#include <drm/xe_drm.h>
> >
> >  #include "regs/xe_gt_regs.h"
> >  #include "xe_bb.h"
> > @@ -500,6 +501,20 @@ static int do_gt_restart(struct xe_gt *gt)
> >  	return 0;
> >  }
> >
> > +static void xe_uevent_gt_reset_failure(struct xe_device *xe, u8 id) {
> > +	char *reset_event[5];
> > +
> > +	reset_event[0] = XE_RESET_FAILED_UEVENT "=1";
> > +	reset_event[1] = "RESET_ENABLED=1";
> > +	reset_event[2] = "RESET_UNIT=gt";
> > +	reset_event[3] = kasprintf(GFP_KERNEL, "RESET_ID=%d", id);
> > +	reset_event[4] = NULL;
> > +	kobject_uevent_env(&xe->drm.primary->kdev->kobj, KOBJ_CHANGE,
> > +reset_event);
> > +
> > +	kfree(reset_event[3]);
> > +}
> > +
> >  static int gt_reset(struct xe_gt *gt)  {
> >  	int err;
> > @@ -549,6 +564,9 @@ static int gt_reset(struct xe_gt *gt)
> >  	xe_device_mem_access_put(gt_to_xe(gt));
> >  	xe_gt_err(gt, "reset failed (%pe)\n", ERR_PTR(err));
> >
> > +	/* Notify userspace about gt reset failure */
> > +	xe_uevent_gt_reset_failure(gt_to_xe(gt), gt->info.id);
> > +
> >  	return err;
> >  }
> >
> > diff --git a/include/uapi/drm/xe_drm.h b/include/uapi/drm/xe_drm.h
> > index 347351a8f618..c28bb54812e5 100644
> > --- a/include/uapi/drm/xe_drm.h
> > +++ b/include/uapi/drm/xe_drm.h
> > @@ -16,6 +16,14 @@ extern "C" {
> >   * subject to backwards-compatibility constraints.
> >   */
> >
> > +/*
> > + * Uevents generated by xe on it's device node.
> > + *
> > + * XE_RESET_FAILED_UEVENT - Event is generated when attempt to reset
> > + * gt fails. The value supplied with the event is always 1.
> > + */
> > +#define XE_RESET_FAILED_UEVENT "RESET_FAILED"
> > +
> >  /**
> >   * struct xe_user_extension - Base class for defining a chain of extensions
> >   *
> > --
> > 2.25.1
> >
> 
> --
> Matt Roper
> Graphics Software Engineer
> Linux GPU Platform Enablement
> Intel Corporation


More information about the Intel-xe mailing list