[Intel-xe] [PATCH v9 3/3] drm/xe: Introduce fault injection for gt reset

Ghimiray, Himal Prasad himal.prasad.ghimiray at intel.com
Tue Aug 1 06:45:46 UTC 2023



> -----Original Message-----
> From: Vivekanandan, Balasubramani
> <balasubramani.vivekanandan at intel.com>
> Sent: 01 August 2023 11:32
> To: Ghimiray, Himal Prasad <himal.prasad.ghimiray at intel.com>; intel-
> xe at lists.freedesktop.org
> Cc: De Marchi, Lucas <lucas.demarchi at intel.com>; Vivi, Rodrigo
> <rodrigo.vivi at intel.com>
> Subject: Re: [Intel-xe] [PATCH v9 3/3] drm/xe: Introduce fault injection for gt
> reset
> 
> On 31.07.2023 20:54, Ghimiray, Himal Prasad wrote:
> > Hi Bala,
> >
> > > -----Original Message-----
> > > From: Vivekanandan, Balasubramani
> > > <balasubramani.vivekanandan at intel.com>
> > > Sent: 31 July 2023 20:00
> > > To: Ghimiray, Himal Prasad <himal.prasad.ghimiray at intel.com>; intel-
> > > xe at lists.freedesktop.org
> > > Cc: De Marchi, Lucas <lucas.demarchi at intel.com>; Vivi, Rodrigo
> > > <rodrigo.vivi at intel.com>
> > > Subject: Re: [Intel-xe] [PATCH v9 3/3] drm/xe: Introduce fault
> > > injection for gt reset
> > >
> > > On 27.07.2023 04:56, Himal Prasad Ghimiray wrote:
> > > > To trigger gt reset failure:
> > > >  echo 100 >
> > > > /sys/kernel/debug/dri/<cardX>/fail_gt_reset/probability
> > > >  echo 2 >  /sys/kernel/debug/dri/<cardX>/fail_gt_reset/times
> > > >
> > > > Cc: Rodrigo Vivi <rodrigo.vivi at intel.com>
> > > > Cc: Lucas De Marchi <lucas.demarchi at intel.com>
> > > >
> > > > Reviewed-by: Rodrigo Vivi <rodrigo.vivi at intel.com>
> > > > Signed-off-by: Himal Prasad Ghimiray
> > > > <himal.prasad.ghimiray at intel.com>
> > > > ---
> > > >  drivers/gpu/drm/xe/xe_debugfs.c | 10 ++++++++++
> > > >  drivers/gpu/drm/xe/xe_gt.c      |  8 +++++++-
> > > >  drivers/gpu/drm/xe/xe_gt.h      | 14 ++++++++++++++
> > > >  3 files changed, 31 insertions(+), 1 deletion(-)
> > > >
> > > > diff --git a/drivers/gpu/drm/xe/xe_debugfs.c
> > > > b/drivers/gpu/drm/xe/xe_debugfs.c index
> 047341d5689a..1fd016e6f7a0
> > > > 100644
> > > > --- a/drivers/gpu/drm/xe/xe_debugfs.c
> > > > +++ b/drivers/gpu/drm/xe/xe_debugfs.c
> > > > @@ -5,6 +5,7 @@
> > > >
> > > >  #include "xe_debugfs.h"
> > > >
> > > > +#include <linux/fault-inject.h>
> > > >  #include <linux/string_helpers.h>
> > > >
> > > >  #include <drm/drm_debugfs.h>
> > > > @@ -20,6 +21,10 @@
> > > >  #include "xe_vm.h"
> > > >  #endif
> > > >
> > > > +#ifdef CONFIG_FAULT_INJECTION
> > > > +DECLARE_FAULT_ATTR(gt_reset_failure);
> > > > +#endif
> > > > +
> > > >  static struct xe_device *node_to_xe(struct drm_info_node *node)  {
> > > >  	return to_xe_device(node->minor->dev); @@ -135,4 +140,9 @@
> > > void
> > > > xe_debugfs_register(struct xe_device *xe)
> > > >
> > > >  	for_each_gt(gt, xe, id)
> > > >  		xe_gt_debugfs_register(gt);
> > > > +
> > > > +#ifdef CONFIG_FAULT_INJECTION
> > > > +	fault_create_debugfs_attr("fail_gt_reset", root,
> > > > +&gt_reset_failure); #endif
> > > > +
> > > >  }
> > > > diff --git a/drivers/gpu/drm/xe/xe_gt.c
> > > > b/drivers/gpu/drm/xe/xe_gt.c index 5e70e486b27c..691e3baf97c9
> > > > 100644
> > > > --- a/drivers/gpu/drm/xe/xe_gt.c
> > > > +++ b/drivers/gpu/drm/xe/xe_gt.c
> > > > @@ -525,6 +525,11 @@ static int gt_reset(struct xe_gt *gt)
> > > >
> > > >  	xe_gt_info(gt, "reset started\n");
> > > >
> > > > +	if (xe_fault_inject_gt_reset()) {
> > > > +		err = -ECANCELED;
> > > > +		goto err_fail;
> > > > +	}
> > > > +
> > > >  	xe_gt_sanitize(gt);
> > > >
> > > >  	xe_device_mem_access_get(gt_to_xe(gt));
> > > > @@ -563,6 +568,7 @@ static int gt_reset(struct xe_gt *gt)
> > > >  err_msg:
> > > >  	XE_WARN_ON(xe_uc_start(&gt->uc));
> > > >  	xe_device_mem_access_put(gt_to_xe(gt));
> > > > +err_fail:
> > > >  	xe_gt_err(gt, "reset failed (%pe)\n", ERR_PTR(err));
> > > >
> > > >  	/* Notify userspace about gt reset failure */ @@ -584,7 +590,7
> > > > @@ void xe_gt_reset_async(struct xe_gt *gt)
> > > >  	xe_gt_info(gt, "trying reset\n");
> > > >
> > > >  	/* Don't do a reset while one is already in flight */
> > > > -	if (xe_uc_reset_prepare(&gt->uc))
> > > > +	if (!xe_fault_inject_gt_reset() && xe_uc_reset_prepare(&gt->uc))
> > >
> > > When `fail_gt_reset/probability` is set to a less than 100 value,
> > > then
> > > xe_fault_inject_gt_reset() will not always return true. So if the
> > > xe_fault_inject_gt_reset() returns differenet values when invoked
> > > from
> > > xe_gt_reset_async() and gt_reset(), we will have unexpected behaviour.
> > >
> > > We should avoid calling xe_fault_inject_gt_reset() more than once in
> > > a single reset cycle.
> > > We could exit immediately in xe_gt_reset_async() if fault injection is
> enabled.
> > Intention here is to cause next reset fail. Unless you make the probability
> 100 that is not guaranteed.
> > And no.of times more than 1 are perfectly fine as long as probability is 100.
> > Even in commit message it is mentioned clearly to use probability as 100.
> 
> Through the commit message it is understood that we need to supply 100
> and >1 values for probability and no of times debugfs entries. But it doesn't
> prevent anyone from trying other values. In such cases, it is fine for the fault
> injection to not work, but it should not cause any issues in the driver.
> I think if xe_fault_inject_gt_reset() returns false in
> xe_gt_reset_async() and true in gt_reset(), then we will end up in
> xe_uc_reset_prepare() invoked but no gt reset completely executed.
> Thought I haven't analyzed how it would impact driver, I think the driver
> would be left in some unexpected state. We should avoid this.
> 
> Wrong values for the debugfs entries should have no impact on the driver.
Wrong values in debugfs will always impact driver here.
1) If probability is not 100. 
You don’t know which reset will fail and which will pass, which is further indeterministic from software perspective.
So passing wrong value to probability will impact irrespective of times.
2) If probability is 100 and times = -1, all the resets will fail irrespective we want fault injection or not. 

So passing of wrong values is something we cant guarantee to work with in case of fault_injection.
Just to make code flow work with times = 1,  I can cache the state of xe_fault_inject_gt_reset() and use it In both places, 
but it doesn't help with passing of wrong values to debugfs at all.

Had discussion with architects  regarding this before, implementation:
Query: " On further exploration:  In fault injection mechanism the privilege user can make all the resets fail, whereas impact of https://patchwork.freedesktop.org/series/119784/ is limited to the  reset caused by this particular debugfs.
for example : if user sets attr: times = -1( meaning unlimited ) and attr probability as 100, all the trials to gt_reset will fall  due to fault injection.  This shouldn't be of concern as long as we are setting correct values."

Response was:
" Yeah, there should be no concern on user writing bad values to debugfs, it's mostly only sysfs we need to be careful with"

> 
> Regards,
> Bala
> 
> >
> > BR
> > Himal
> > >
> > > Regards,
> > > Bala
> > > >  		return;
> > > >
> > > >  	xe_gt_info(gt, "reset queued\n"); diff --git
> > > > a/drivers/gpu/drm/xe/xe_gt.h b/drivers/gpu/drm/xe/xe_gt.h index
> > > > 7298653a73de..caded203a8a0 100644
> > > > --- a/drivers/gpu/drm/xe/xe_gt.h
> > > > +++ b/drivers/gpu/drm/xe/xe_gt.h
> > > > @@ -7,6 +7,7 @@
> > > >  #define _XE_GT_H_
> > > >
> > > >  #include <drm/drm_util.h>
> > > > +#include <linux/fault-inject.h>
> > > >
> > > >  #include "xe_device_types.h"
> > > >  #include "xe_hw_engine.h"
> > > > @@ -16,6 +17,19 @@
> > > >  		for_each_if(((hwe__) = (gt__)->hw_engines + (id__)) && \
> > > >  			  xe_hw_engine_is_valid((hwe__)))
> > > >
> > > > +#ifdef CONFIG_FAULT_INJECTION
> > > > +extern struct fault_attr gt_reset_failure; static inline bool
> > > > +xe_fault_inject_gt_reset(void) {
> > > > +	return should_fail(&gt_reset_failure, 1); } #else static inline
> > > > +bool
> > > > +xe_fault_inject_gt_reset(void) {
> > > > +	return false;
> > > > +}
> > > > +#endif
> > > > +
> > > >  struct xe_gt *xe_gt_alloc(struct xe_tile *tile);  int
> > > > xe_gt_init_early(struct xe_gt *gt);  int xe_gt_init(struct xe_gt
> > > > *gt);
> > > > --
> > > > 2.25.1
> > > >


More information about the Intel-xe mailing list