[Intel-xe] [PATCH v9 3/3] drm/xe: Introduce fault injection for gt reset

Tue Aug 1 06:02:16 UTC 2023

On 31.07.2023 20:54, Ghimiray, Himal Prasad wrote:
> Hi Bala,
> 
> > -----Original Message-----
> > From: Vivekanandan, Balasubramani
> > <balasubramani.vivekanandan at intel.com>
> > Sent: 31 July 2023 20:00
> > To: Ghimiray, Himal Prasad <himal.prasad.ghimiray at intel.com>; intel-
> > xe at lists.freedesktop.org
> > Cc: De Marchi, Lucas <lucas.demarchi at intel.com>; Vivi, Rodrigo
> > <rodrigo.vivi at intel.com>
> > Subject: Re: [Intel-xe] [PATCH v9 3/3] drm/xe: Introduce fault injection for gt
> > reset
> > 
> > On 27.07.2023 04:56, Himal Prasad Ghimiray wrote:
> > > To trigger gt reset failure:
> > >  echo 100 >  /sys/kernel/debug/dri/<cardX>/fail_gt_reset/probability
> > >  echo 2 >  /sys/kernel/debug/dri/<cardX>/fail_gt_reset/times
> > >
> > > Cc: Rodrigo Vivi <rodrigo.vivi at intel.com>
> > > Cc: Lucas De Marchi <lucas.demarchi at intel.com>
> > >
> > > Reviewed-by: Rodrigo Vivi <rodrigo.vivi at intel.com>
> > > Signed-off-by: Himal Prasad Ghimiray <himal.prasad.ghimiray at intel.com>
> > > ---
> > >  drivers/gpu/drm/xe/xe_debugfs.c | 10 ++++++++++
> > >  drivers/gpu/drm/xe/xe_gt.c      |  8 +++++++-
> > >  drivers/gpu/drm/xe/xe_gt.h      | 14 ++++++++++++++
> > >  3 files changed, 31 insertions(+), 1 deletion(-)
> > >
> > > diff --git a/drivers/gpu/drm/xe/xe_debugfs.c
> > > b/drivers/gpu/drm/xe/xe_debugfs.c index 047341d5689a..1fd016e6f7a0
> > > 100644
> > > --- a/drivers/gpu/drm/xe/xe_debugfs.c
> > > +++ b/drivers/gpu/drm/xe/xe_debugfs.c
> > > @@ -5,6 +5,7 @@
> > >
> > >  #include "xe_debugfs.h"
> > >
> > > +#include <linux/fault-inject.h>
> > >  #include <linux/string_helpers.h>
> > >
> > >  #include <drm/drm_debugfs.h>
> > > @@ -20,6 +21,10 @@
> > >  #include "xe_vm.h"
> > >  #endif
> > >
> > > +#ifdef CONFIG_FAULT_INJECTION
> > > +DECLARE_FAULT_ATTR(gt_reset_failure);
> > > +#endif
> > > +
> > >  static struct xe_device *node_to_xe(struct drm_info_node *node)  {
> > >  	return to_xe_device(node->minor->dev); @@ -135,4 +140,9 @@
> > void
> > > xe_debugfs_register(struct xe_device *xe)
> > >
> > >  	for_each_gt(gt, xe, id)
> > >  		xe_gt_debugfs_register(gt);
> > > +
> > > +#ifdef CONFIG_FAULT_INJECTION
> > > +	fault_create_debugfs_attr("fail_gt_reset", root, &gt_reset_failure);
> > > +#endif
> > > +
> > >  }
> > > diff --git a/drivers/gpu/drm/xe/xe_gt.c b/drivers/gpu/drm/xe/xe_gt.c
> > > index 5e70e486b27c..691e3baf97c9 100644
> > > --- a/drivers/gpu/drm/xe/xe_gt.c
> > > +++ b/drivers/gpu/drm/xe/xe_gt.c
> > > @@ -525,6 +525,11 @@ static int gt_reset(struct xe_gt *gt)
> > >
> > >  	xe_gt_info(gt, "reset started\n");
> > >
> > > +	if (xe_fault_inject_gt_reset()) {
> > > +		err = -ECANCELED;
> > > +		goto err_fail;
> > > +	}
> > > +
> > >  	xe_gt_sanitize(gt);
> > >
> > >  	xe_device_mem_access_get(gt_to_xe(gt));
> > > @@ -563,6 +568,7 @@ static int gt_reset(struct xe_gt *gt)
> > >  err_msg:
> > >  	XE_WARN_ON(xe_uc_start(&gt->uc));
> > >  	xe_device_mem_access_put(gt_to_xe(gt));
> > > +err_fail:
> > >  	xe_gt_err(gt, "reset failed (%pe)\n", ERR_PTR(err));
> > >
> > >  	/* Notify userspace about gt reset failure */ @@ -584,7 +590,7 @@
> > > void xe_gt_reset_async(struct xe_gt *gt)
> > >  	xe_gt_info(gt, "trying reset\n");
> > >
> > >  	/* Don't do a reset while one is already in flight */
> > > -	if (xe_uc_reset_prepare(&gt->uc))
> > > +	if (!xe_fault_inject_gt_reset() && xe_uc_reset_prepare(&gt->uc))
> > 
> > When `fail_gt_reset/probability` is set to a less than 100 value, then
> > xe_fault_inject_gt_reset() will not always return true. So if the
> > xe_fault_inject_gt_reset() returns differenet values when invoked from
> > xe_gt_reset_async() and gt_reset(), we will have unexpected behaviour.
> > 
> > We should avoid calling xe_fault_inject_gt_reset() more than once in a single
> > reset cycle.
> > We could exit immediately in xe_gt_reset_async() if fault injection is enabled.
> Intention here is to cause next reset fail. Unless you make the probability 100 that is not guaranteed. 
> And no.of times more than 1 are perfectly fine as long as probability is 100. 
> Even in commit message it is mentioned clearly to use probability as 100.

Through the commit message it is understood that we need to supply 100 and >1
values for probability and no of times debugfs entries. But it doesn't
prevent anyone from trying other values. In such cases, it is fine for
the fault injection to not work, but it should not cause any issues in
the driver.
I think if xe_fault_inject_gt_reset() returns false in
xe_gt_reset_async() and true in gt_reset(), then we will end up in
xe_uc_reset_prepare() invoked but no gt reset completely executed.
Thought I haven't analyzed how it would impact driver, I think the
driver would be left in some unexpected state. We should avoid this.

Wrong values for the debugfs entries should have no impact on the
driver.

Regards,
Bala

> 
> BR
> Himal 
> > 
> > Regards,
> > Bala
> > >  		return;
> > >
> > >  	xe_gt_info(gt, "reset queued\n");
> > > diff --git a/drivers/gpu/drm/xe/xe_gt.h b/drivers/gpu/drm/xe/xe_gt.h
> > > index 7298653a73de..caded203a8a0 100644
> > > --- a/drivers/gpu/drm/xe/xe_gt.h
> > > +++ b/drivers/gpu/drm/xe/xe_gt.h
> > > @@ -7,6 +7,7 @@
> > >  #define _XE_GT_H_
> > >
> > >  #include <drm/drm_util.h>
> > > +#include <linux/fault-inject.h>
> > >
> > >  #include "xe_device_types.h"
> > >  #include "xe_hw_engine.h"
> > > @@ -16,6 +17,19 @@
> > >  		for_each_if(((hwe__) = (gt__)->hw_engines + (id__)) && \
> > >  			  xe_hw_engine_is_valid((hwe__)))
> > >
> > > +#ifdef CONFIG_FAULT_INJECTION
> > > +extern struct fault_attr gt_reset_failure; static inline bool
> > > +xe_fault_inject_gt_reset(void) {
> > > +	return should_fail(&gt_reset_failure, 1); } #else static inline bool
> > > +xe_fault_inject_gt_reset(void) {
> > > +	return false;
> > > +}
> > > +#endif
> > > +
> > >  struct xe_gt *xe_gt_alloc(struct xe_tile *tile);  int
> > > xe_gt_init_early(struct xe_gt *gt);  int xe_gt_init(struct xe_gt *gt);
> > > --
> > > 2.25.1
> > >