[Intel-xe] [PATCH v9 3/3] drm/xe: Introduce fault injection for gt reset

Tue Aug 1 08:04:52 UTC 2023


> -----Original Message-----
> From: Vivekanandan, Balasubramani
> <balasubramani.vivekanandan at intel.com>
> Sent: 01 August 2023 13:33
> To: Ghimiray, Himal Prasad <himal.prasad.ghimiray at intel.com>; intel-
> xe at lists.freedesktop.org
> Cc: De Marchi, Lucas <lucas.demarchi at intel.com>; Vivi, Rodrigo
> <rodrigo.vivi at intel.com>
> Subject: Re: [Intel-xe] [PATCH v9 3/3] drm/xe: Introduce fault injection for gt
> reset
> 
> On 01.08.2023 12:15, Ghimiray, Himal Prasad wrote:
> >
> >
> > > -----Original Message-----
> > > From: Vivekanandan, Balasubramani
> > > <balasubramani.vivekanandan at intel.com>
> > > Sent: 01 August 2023 11:32
> > > To: Ghimiray, Himal Prasad <himal.prasad.ghimiray at intel.com>; intel-
> > > xe at lists.freedesktop.org
> > > Cc: De Marchi, Lucas <lucas.demarchi at intel.com>; Vivi, Rodrigo
> > > <rodrigo.vivi at intel.com>
> > > Subject: Re: [Intel-xe] [PATCH v9 3/3] drm/xe: Introduce fault
> > > injection for gt reset
> > >
> > > On 31.07.2023 20:54, Ghimiray, Himal Prasad wrote:
> > > > Hi Bala,
> > > >
> > > > > -----Original Message-----
> > > > > From: Vivekanandan, Balasubramani
> > > > > <balasubramani.vivekanandan at intel.com>
> > > > > Sent: 31 July 2023 20:00
> > > > > To: Ghimiray, Himal Prasad <himal.prasad.ghimiray at intel.com>;
> > > > > intel- xe at lists.freedesktop.org
> > > > > Cc: De Marchi, Lucas <lucas.demarchi at intel.com>; Vivi, Rodrigo
> > > > > <rodrigo.vivi at intel.com>
> > > > > Subject: Re: [Intel-xe] [PATCH v9 3/3] drm/xe: Introduce fault
> > > > > injection for gt reset
> > > > >
> > > > > On 27.07.2023 04:56, Himal Prasad Ghimiray wrote:
> > > > > > To trigger gt reset failure:
> > > > > >  echo 100 >
> > > > > > /sys/kernel/debug/dri/<cardX>/fail_gt_reset/probability
> > > > > >  echo 2 >  /sys/kernel/debug/dri/<cardX>/fail_gt_reset/times
> > > > > >
> > > > > > Cc: Rodrigo Vivi <rodrigo.vivi at intel.com>
> > > > > > Cc: Lucas De Marchi <lucas.demarchi at intel.com>
> > > > > >
> > > > > > Reviewed-by: Rodrigo Vivi <rodrigo.vivi at intel.com>
> > > > > > Signed-off-by: Himal Prasad Ghimiray
> > > > > > <himal.prasad.ghimiray at intel.com>
> > > > > > ---
> > > > > >  drivers/gpu/drm/xe/xe_debugfs.c | 10 ++++++++++
> > > > > >  drivers/gpu/drm/xe/xe_gt.c      |  8 +++++++-
> > > > > >  drivers/gpu/drm/xe/xe_gt.h      | 14 ++++++++++++++
> > > > > >  3 files changed, 31 insertions(+), 1 deletion(-)
> > > > > >
> > > > > > diff --git a/drivers/gpu/drm/xe/xe_debugfs.c
> > > > > > b/drivers/gpu/drm/xe/xe_debugfs.c index
> > > 047341d5689a..1fd016e6f7a0
> > > > > > 100644
> > > > > > --- a/drivers/gpu/drm/xe/xe_debugfs.c
> > > > > > +++ b/drivers/gpu/drm/xe/xe_debugfs.c
> > > > > > @@ -5,6 +5,7 @@
> > > > > >
> > > > > >  #include "xe_debugfs.h"
> > > > > >
> > > > > > +#include <linux/fault-inject.h>
> > > > > >  #include <linux/string_helpers.h>
> > > > > >
> > > > > >  #include <drm/drm_debugfs.h>
> > > > > > @@ -20,6 +21,10 @@
> > > > > >  #include "xe_vm.h"
> > > > > >  #endif
> > > > > >
> > > > > > +#ifdef CONFIG_FAULT_INJECTION
> > > > > > +DECLARE_FAULT_ATTR(gt_reset_failure);
> > > > > > +#endif
> > > > > > +
> > > > > >  static struct xe_device *node_to_xe(struct drm_info_node *node)
> {
> > > > > >  	return to_xe_device(node->minor->dev); @@ -135,4 +140,9
> @@
> > > > > void
> > > > > > xe_debugfs_register(struct xe_device *xe)
> > > > > >
> > > > > >  	for_each_gt(gt, xe, id)
> > > > > >  		xe_gt_debugfs_register(gt);
> > > > > > +
> > > > > > +#ifdef CONFIG_FAULT_INJECTION
> > > > > > +	fault_create_debugfs_attr("fail_gt_reset", root,
> > > > > > +&gt_reset_failure); #endif
> > > > > > +
> > > > > >  }
> > > > > > diff --git a/drivers/gpu/drm/xe/xe_gt.c
> > > > > > b/drivers/gpu/drm/xe/xe_gt.c index 5e70e486b27c..691e3baf97c9
> > > > > > 100644
> > > > > > --- a/drivers/gpu/drm/xe/xe_gt.c
> > > > > > +++ b/drivers/gpu/drm/xe/xe_gt.c
> > > > > > @@ -525,6 +525,11 @@ static int gt_reset(struct xe_gt *gt)
> > > > > >
> > > > > >  	xe_gt_info(gt, "reset started\n");
> > > > > >
> > > > > > +	if (xe_fault_inject_gt_reset()) {
> > > > > > +		err = -ECANCELED;
> > > > > > +		goto err_fail;
> > > > > > +	}
> > > > > > +
> > > > > >  	xe_gt_sanitize(gt);
> > > > > >
> > > > > >  	xe_device_mem_access_get(gt_to_xe(gt));
> > > > > > @@ -563,6 +568,7 @@ static int gt_reset(struct xe_gt *gt)
> > > > > >  err_msg:
> > > > > >  	XE_WARN_ON(xe_uc_start(&gt->uc));
> > > > > >  	xe_device_mem_access_put(gt_to_xe(gt));
> > > > > > +err_fail:
> > > > > >  	xe_gt_err(gt, "reset failed (%pe)\n", ERR_PTR(err));
> > > > > >
> > > > > >  	/* Notify userspace about gt reset failure */ @@ -584,7
> > > > > > +590,7 @@ void xe_gt_reset_async(struct xe_gt *gt)
> > > > > >  	xe_gt_info(gt, "trying reset\n");
> > > > > >
> > > > > >  	/* Don't do a reset while one is already in flight */
> > > > > > -	if (xe_uc_reset_prepare(&gt->uc))
> > > > > > +	if (!xe_fault_inject_gt_reset() &&
> > > > > > +xe_uc_reset_prepare(&gt->uc))
> > > > >
> > > > > When `fail_gt_reset/probability` is set to a less than 100
> > > > > value, then
> > > > > xe_fault_inject_gt_reset() will not always return true. So if
> > > > > the
> > > > > xe_fault_inject_gt_reset() returns differenet values when
> > > > > invoked from
> > > > > xe_gt_reset_async() and gt_reset(), we will have unexpected
> behaviour.
> > > > >
> > > > > We should avoid calling xe_fault_inject_gt_reset() more than
> > > > > once in a single reset cycle.
> > > > > We could exit immediately in xe_gt_reset_async() if fault
> > > > > injection is
> > > enabled.
> > > > Intention here is to cause next reset fail. Unless you make the
> > > > probability
> > > 100 that is not guaranteed.
> > > > And no.of times more than 1 are perfectly fine as long as probability is
> 100.
> > > > Even in commit message it is mentioned clearly to use probability as
> 100.
> > >
> > > Through the commit message it is understood that we need to supply
> > > 100 and >1 values for probability and no of times debugfs entries.
> > > But it doesn't prevent anyone from trying other values. In such
> > > cases, it is fine for the fault injection to not work, but it should not cause
> any issues in the driver.
> > > I think if xe_fault_inject_gt_reset() returns false in
> > > xe_gt_reset_async() and true in gt_reset(), then we will end up in
> > > xe_uc_reset_prepare() invoked but no gt reset completely executed.
> > > Thought I haven't analyzed how it would impact driver, I think the
> > > driver would be left in some unexpected state. We should avoid this.
> > >
> > > Wrong values for the debugfs entries should have no impact on the
> driver.
> > Wrong values in debugfs will always impact driver here.
> > 1) If probability is not 100.
> > You don’t know which reset will fail and which will pass, which is further
> indeterministic from software perspective.
> > So passing wrong value to probability will impact irrespective of times.
> > 2) If probability is 100 and times = -1, all the resets will fail irrespective we
> want fault injection or not.
> >
> > So passing of wrong values is something we cant guarantee to work with in
> case of fault_injection.
> > Just to make code flow work with times = 1,  I can cache the state of
> > xe_fault_inject_gt_reset() and use it In both places, but it doesn't help
> with passing of wrong values to debugfs at all.
> >
> > Had discussion with architects  regarding this before, implementation:
> > Query: " On further exploration:  In fault injection mechanism the privilege
> user can make all the resets fail, whereas impact of
> https://patchwork.freedesktop.org/series/119784/ is limited to the  reset
> caused by this particular debugfs.
> > for example : if user sets attr: times = -1( meaning unlimited ) and attr
> probability as 100, all the trials to gt_reset will fall  due to fault injection.  This
> shouldn't be of concern as long as we are setting correct values."
> >
> > Response was:
> > " Yeah, there should be no concern on user writing bad values to debugfs,
> it's mostly only sysfs we need to be careful with"
> 
> Can we exit from xe_gt_reset_async() itself when fault injection is true, and
> not proceed with gt_reset()?
No. We need gt_reset to fail. That is the requirement from fault_injection.
> 
> Regards,
> Bala
> 
> >
> > >
> > > Regards,
> > > Bala
> > >
> > > >
> > > > BR
> > > > Himal
> > > > >
> > > > > Regards,
> > > > > Bala
> > > > > >  		return;
> > > > > >
> > > > > >  	xe_gt_info(gt, "reset queued\n"); diff --git
> > > > > > a/drivers/gpu/drm/xe/xe_gt.h b/drivers/gpu/drm/xe/xe_gt.h
> > > > > > index
> > > > > > 7298653a73de..caded203a8a0 100644
> > > > > > --- a/drivers/gpu/drm/xe/xe_gt.h
> > > > > > +++ b/drivers/gpu/drm/xe/xe_gt.h
> > > > > > @@ -7,6 +7,7 @@
> > > > > >  #define _XE_GT_H_
> > > > > >
> > > > > >  #include <drm/drm_util.h>
> > > > > > +#include <linux/fault-inject.h>
> > > > > >
> > > > > >  #include "xe_device_types.h"
> > > > > >  #include "xe_hw_engine.h"
> > > > > > @@ -16,6 +17,19 @@
> > > > > >  		for_each_if(((hwe__) = (gt__)->hw_engines +
> (id__)) && \
> > > > > >  			  xe_hw_engine_is_valid((hwe__)))
> > > > > >
> > > > > > +#ifdef CONFIG_FAULT_INJECTION extern struct fault_attr
> > > > > > +gt_reset_failure; static inline bool
> > > > > > +xe_fault_inject_gt_reset(void) {
> > > > > > +	return should_fail(&gt_reset_failure, 1); } #else static
> > > > > > +inline bool
> > > > > > +xe_fault_inject_gt_reset(void) {
> > > > > > +	return false;
> > > > > > +}
> > > > > > +#endif
> > > > > > +
> > > > > >  struct xe_gt *xe_gt_alloc(struct xe_tile *tile);  int
> > > > > > xe_gt_init_early(struct xe_gt *gt);  int xe_gt_init(struct
> > > > > > xe_gt *gt);
> > > > > > --
> > > > > > 2.25.1
> > > > > >