[Intel-xe] [PATCH v2 1/1] drm/xe: Add a debugfs for faking gt reset failure.

Tue Jun 6 08:23:27 UTC 2023

On Tue, 6 Jun 2023 04:03:20 +0000
"Ghimiray, Himal Prasad" <himal.prasad.ghimiray at intel.com> wrote:

> > -----Original Message-----
> > From: Intel-xe <intel-xe-bounces at lists.freedesktop.org> On Behalf Of
> > Mauro Carvalho Chehab
> > Sent: 25 May 2023 11:36
> > To: Vivi, Rodrigo <rodrigo.vivi at intel.com>
> > Cc: mauro.chehab at kernel.org; intel-xe at lists.freedesktop.org
> > Subject: Re: [Intel-xe] [PATCH v2 1/1] drm/xe: Add a debugfs for faking gt
> > reset failure.
> > 
> > On Wed, 24 May 2023 16:50:04 +0000
> > "Vivi, Rodrigo" <rodrigo.vivi at intel.com> wrote:
> >   
> > > On Wed, 2023-05-24 at 12:43 -0400, Rodrigo Vivi wrote:
> > > On Wed, May 24, 2023 at 09:57:00PM +0530, Balasubramani Vivekanandan  
> > wrote:  
> > > On 24.05.2023 12:36, Himal Prasad Ghimiray wrote:
> > > In case of gt reset failure, KMD notifies userspace about failure via
> > > uevent. To validate this notification we need to ensure gt reset fails
> > > and there is no mechanism to cause failure from hardware.
> > > Hence added a debugfs which will cause fake reset failure.
> > >
> > > v1(Rodrigo)
> > > - Change the variable to fake_reset_failure_in_progress.
> > > - Drop usage of READ_ONCE and WRITE_ONCE.
> > > - Follow consistency for variable assignment. Either use
> > >   functions for all the assignments or don't use for any.
> > >
> > > Cc: Rodrigo Vivi
> > > <rodrigo.vivi at intel.com<mailto:rodrigo.vivi at intel.com>>
> > > Signed-off-by: Himal Prasad Ghimiray
> > > <himal.prasad.ghimiray at intel.com<mailto:himal.prasad.ghimiray at intel.co  
> > > m>>  
> > > ---
> > >  drivers/gpu/drm/xe/xe_gt.c         | 15 ++++++++++++++-
> > >  drivers/gpu/drm/xe/xe_gt_debugfs.c | 11 +++++++++++
> > >  drivers/gpu/drm/xe/xe_gt_types.h   |  1 +
> > >  3 files changed, 26 insertions(+), 1 deletion(-)
> > >
> > > diff --git a/drivers/gpu/drm/xe/xe_gt.c b/drivers/gpu/drm/xe/xe_gt.c
> > > index 80d42c7c7cfa..b3ac8b3ab455 100644
> > > --- a/drivers/gpu/drm/xe/xe_gt.c
> > > +++ b/drivers/gpu/drm/xe/xe_gt.c
> > > @@ -506,6 +506,9 @@ int xe_gt_init(struct xe_gt *gt)
> > >         int err;
> > >         int i;
> > >
> > > +       /*Fake reset failure should be disabled by default*/
> > > +       gt->reset.fake_reset_failure_in_progress = false;
> > > +
> > >         INIT_WORK(&gt->reset.worker, gt_reset_worker);
> > >
> > >         for (i = 0; i < XE_ENGINE_CLASS_MAX; ++i) { @@ -601,8 +604,18
> > > @@ static int gt_reset(struct xe_gt *gt)
> > >         xe_gt_info(gt, "reset started\n");
> > >
> > >         xe_gt_sanitize(gt);
> > > -
> > >         xe_device_mem_access_get(gt_to_xe(gt));
> > > +
> > > +       err = gt->reset.fake_reset_failure_in_progress;
> > > +       if (err) {
> > >
> > > This doesn't look the right way. It should be something like
> > >
> > > ```
> > > if (gt->reset.fake_reset_failure_in_progress) {
> > >
> > >         err = -ECANCELED; or whatever appropriate
> > >         ...
> > > ```
> > >
> > > +               xe_gt_info(gt, "Fake GT reset failure is in
> > > + progress\n");
> > > +
> > > +               /* Disable fake reset failure for next call */
> > > +               gt->reset.fake_reset_failure_in_progress = false;
> > > +
> > > +               goto err_msg;
> > > +       }
> > > +
> > >         err = xe_force_wake_get(gt_to_fw(gt), XE_FORCEWAKE_ALL);
> > >         if (err)
> > >                 goto err_msg;
> > > diff --git a/drivers/gpu/drm/xe/xe_gt_debugfs.c
> > > b/drivers/gpu/drm/xe/xe_gt_debugfs.c
> > > index 8bf441e850a0..13d2a974bc00 100644
> > > --- a/drivers/gpu/drm/xe/xe_gt_debugfs.c
> > > +++ b/drivers/gpu/drm/xe/xe_gt_debugfs.c
> > > @@ -127,6 +127,16 @@ static int register_save_restore(struct seq_file *m,  
> > void *data)  
> > >         return 0;
> > >  }
> > >
> > > +static int fake_reset_failure(struct seq_file *m, void *data) {
> > > +       struct xe_gt *gt = node_to_gt(m->private);
> > > +
> > > +       gt->reset.fake_reset_failure_in_progress = true;
> > > +       xe_gt_reset_async(gt);
> > > +
> > > +       return 0;
> > > +}
> > > +
> > >  static const struct drm_info_list debugfs_list[] = {
> > >         {"hw_engines", hw_engines, 0},
> > >         {"force_reset", force_reset, 0}, @@ -135,6 +145,7 @@ static
> > > const struct drm_info_list debugfs_list[] = {
> > >         {"steering", steering, 0},
> > >         {"ggtt", ggtt, 0},
> > >         {"register-save-restore", register_save_restore, 0},
> > > +       {"fake_reset_failure", fake_reset_failure, 0},
> > >  };
> > >
> > >  void xe_gt_debugfs_register(struct xe_gt *gt) diff --git
> > > a/drivers/gpu/drm/xe/xe_gt_types.h  
> > b/drivers/gpu/drm/xe/xe_gt_types.h  
> > > index 7c47d67aa8be..746d65c4aacc 100644
> > > --- a/drivers/gpu/drm/xe/xe_gt_types.h
> > > +++ b/drivers/gpu/drm/xe/xe_gt_types.h
> > > @@ -168,6 +168,7 @@ struct xe_gt {
> > >
> > >         /** @reset: state for GT resets */
> > >         struct {
> > > +               bool fake_reset_failure_in_progress;
> > >                 /**
> > >                  * @worker: work so GT resets can done async allowing to reset
> > >                  * code to safely flush all code paths
> > >
> > > @Rodrigo
> > > This feature is purely to support test the driver and not something
> > > which helps debugging. I am concerned whether such code should be part
> > > of production build. I am thinking if this feature or any such patches
> > > in future could be made part of a kernel config, which can be enabled
> > > in CI when the driver is subjected to testing.
> > >
> > > I like this idea of having things like this behind a kernel config,
> > > but we need to be careful to not simply break IGT or force IGT to
> > > always depend on this flag.  
> > 
> > Well, debugfs is actually something that shouldn't be used in production ;-)
> > 
> > Granted, read-only debugfs nodes that would be just printing the engine
> > states without affecting the driver functionality can be present on production
> > environments, as those could be used by sysadmins to report issues to
> > distribution vendors.
> > 
> > On other words, we may have two separate groups of debugfs entries:
> > 
> > - entries that won't be changing the driver or GPU state machine;
> > - entries that will be affecting the driver, like forcing resets
> >   forcing reset to fail, etc. are more dangerous and may deserve an
> >   special Kconfig entry.
> > 
> > On such case, I would be placing a scary message at the Kconfig.debug file,
> > like:
> > 
> > 	config DRM_XE_DANGEROUS_DEBUGFS
> > 		bool "Enable driver development debugfs entries
> > (DANGEROUS)"
> > 		depends on DRM_XE_DEBUG && DEBUGFS
> > 		default n
> > 
> > 		help
> > 
> > 		  Add special debugfs entries meant to test Xe driver.
> > 
> > 		  Never enable those in production as those may cause
> > 		  malfunctioning at the Xe driver.
> > 
> > 		  Set it to "N", except if you're debugging Xe driver.
> > 
> > I would also place those options on a separate debugfs list, e. g.:  
> Hi Mauro,
> Thanks for proposing on this. For the time being can we go ahead and introduce this debugfs in existing list ?
> Once the changes are in for separating debugfs list  it into danger_debugfs_list  and debugfs_list 
> we will move fake_reset_failure debugfs  to that list along with other applicable entries. 

Works for me.

Regards,
Mauro

> 
> BR
> Himal 
> > 
> > 	static const struct drm_info_list danger_debugfs_list[] = {
> > 		{"force_reset", force_reset, 0},
> > 		/*
> > 		 * Whatever other node that writing to it would cause
> > 		 * side effects at the driver
> > 		 */
> > 		{"fake_reset_failure", fake_reset_failure, 0},
> > 	};
> > 
> > In time, DRM_XE_USERPTR_INVAL_INJECT is also a dangerous option. Btw,
> > shouldn't it be enabled only via a debugfs entry too? If not, then perhaps we
> > should use a menuconfig where we'll be placing all those error-inject
> > features altogether.
> > 
> > Regards,
> > Mauro