[Intel-xe] [PATCH v2 1/1] drm/xe: Add a debugfs for faking gt reset failure.

Tue Jun 6 04:03:20 UTC 2023


> -----Original Message-----
> From: Intel-xe <intel-xe-bounces at lists.freedesktop.org> On Behalf Of
> Mauro Carvalho Chehab
> Sent: 25 May 2023 11:36
> To: Vivi, Rodrigo <rodrigo.vivi at intel.com>
> Cc: mauro.chehab at kernel.org; intel-xe at lists.freedesktop.org
> Subject: Re: [Intel-xe] [PATCH v2 1/1] drm/xe: Add a debugfs for faking gt
> reset failure.
> 
> On Wed, 24 May 2023 16:50:04 +0000
> "Vivi, Rodrigo" <rodrigo.vivi at intel.com> wrote:
> 
> > On Wed, 2023-05-24 at 12:43 -0400, Rodrigo Vivi wrote:
> > On Wed, May 24, 2023 at 09:57:00PM +0530, Balasubramani Vivekanandan
> wrote:
> > On 24.05.2023 12:36, Himal Prasad Ghimiray wrote:
> > In case of gt reset failure, KMD notifies userspace about failure via
> > uevent. To validate this notification we need to ensure gt reset fails
> > and there is no mechanism to cause failure from hardware.
> > Hence added a debugfs which will cause fake reset failure.
> >
> > v1(Rodrigo)
> > - Change the variable to fake_reset_failure_in_progress.
> > - Drop usage of READ_ONCE and WRITE_ONCE.
> > - Follow consistency for variable assignment. Either use
> >   functions for all the assignments or don't use for any.
> >
> > Cc: Rodrigo Vivi
> > <rodrigo.vivi at intel.com<mailto:rodrigo.vivi at intel.com>>
> > Signed-off-by: Himal Prasad Ghimiray
> > <himal.prasad.ghimiray at intel.com<mailto:himal.prasad.ghimiray at intel.co
> > m>>
> > ---
> >  drivers/gpu/drm/xe/xe_gt.c         | 15 ++++++++++++++-
> >  drivers/gpu/drm/xe/xe_gt_debugfs.c | 11 +++++++++++
> >  drivers/gpu/drm/xe/xe_gt_types.h   |  1 +
> >  3 files changed, 26 insertions(+), 1 deletion(-)
> >
> > diff --git a/drivers/gpu/drm/xe/xe_gt.c b/drivers/gpu/drm/xe/xe_gt.c
> > index 80d42c7c7cfa..b3ac8b3ab455 100644
> > --- a/drivers/gpu/drm/xe/xe_gt.c
> > +++ b/drivers/gpu/drm/xe/xe_gt.c
> > @@ -506,6 +506,9 @@ int xe_gt_init(struct xe_gt *gt)
> >         int err;
> >         int i;
> >
> > +       /*Fake reset failure should be disabled by default*/
> > +       gt->reset.fake_reset_failure_in_progress = false;
> > +
> >         INIT_WORK(&gt->reset.worker, gt_reset_worker);
> >
> >         for (i = 0; i < XE_ENGINE_CLASS_MAX; ++i) { @@ -601,8 +604,18
> > @@ static int gt_reset(struct xe_gt *gt)
> >         xe_gt_info(gt, "reset started\n");
> >
> >         xe_gt_sanitize(gt);
> > -
> >         xe_device_mem_access_get(gt_to_xe(gt));
> > +
> > +       err = gt->reset.fake_reset_failure_in_progress;
> > +       if (err) {
> >
> > This doesn't look the right way. It should be something like
> >
> > ```
> > if (gt->reset.fake_reset_failure_in_progress) {
> >
> >         err = -ECANCELED; or whatever appropriate
> >         ...
> > ```
> >
> > +               xe_gt_info(gt, "Fake GT reset failure is in
> > + progress\n");
> > +
> > +               /* Disable fake reset failure for next call */
> > +               gt->reset.fake_reset_failure_in_progress = false;
> > +
> > +               goto err_msg;
> > +       }
> > +
> >         err = xe_force_wake_get(gt_to_fw(gt), XE_FORCEWAKE_ALL);
> >         if (err)
> >                 goto err_msg;
> > diff --git a/drivers/gpu/drm/xe/xe_gt_debugfs.c
> > b/drivers/gpu/drm/xe/xe_gt_debugfs.c
> > index 8bf441e850a0..13d2a974bc00 100644
> > --- a/drivers/gpu/drm/xe/xe_gt_debugfs.c
> > +++ b/drivers/gpu/drm/xe/xe_gt_debugfs.c
> > @@ -127,6 +127,16 @@ static int register_save_restore(struct seq_file *m,
> void *data)
> >         return 0;
> >  }
> >
> > +static int fake_reset_failure(struct seq_file *m, void *data) {
> > +       struct xe_gt *gt = node_to_gt(m->private);
> > +
> > +       gt->reset.fake_reset_failure_in_progress = true;
> > +       xe_gt_reset_async(gt);
> > +
> > +       return 0;
> > +}
> > +
> >  static const struct drm_info_list debugfs_list[] = {
> >         {"hw_engines", hw_engines, 0},
> >         {"force_reset", force_reset, 0}, @@ -135,6 +145,7 @@ static
> > const struct drm_info_list debugfs_list[] = {
> >         {"steering", steering, 0},
> >         {"ggtt", ggtt, 0},
> >         {"register-save-restore", register_save_restore, 0},
> > +       {"fake_reset_failure", fake_reset_failure, 0},
> >  };
> >
> >  void xe_gt_debugfs_register(struct xe_gt *gt) diff --git
> > a/drivers/gpu/drm/xe/xe_gt_types.h
> b/drivers/gpu/drm/xe/xe_gt_types.h
> > index 7c47d67aa8be..746d65c4aacc 100644
> > --- a/drivers/gpu/drm/xe/xe_gt_types.h
> > +++ b/drivers/gpu/drm/xe/xe_gt_types.h
> > @@ -168,6 +168,7 @@ struct xe_gt {
> >
> >         /** @reset: state for GT resets */
> >         struct {
> > +               bool fake_reset_failure_in_progress;
> >                 /**
> >                  * @worker: work so GT resets can done async allowing to reset
> >                  * code to safely flush all code paths
> >
> > @Rodrigo
> > This feature is purely to support test the driver and not something
> > which helps debugging. I am concerned whether such code should be part
> > of production build. I am thinking if this feature or any such patches
> > in future could be made part of a kernel config, which can be enabled
> > in CI when the driver is subjected to testing.
> >
> > I like this idea of having things like this behind a kernel config,
> > but we need to be careful to not simply break IGT or force IGT to
> > always depend on this flag.
> 
> Well, debugfs is actually something that shouldn't be used in production ;-)
> 
> Granted, read-only debugfs nodes that would be just printing the engine
> states without affecting the driver functionality can be present on production
> environments, as those could be used by sysadmins to report issues to
> distribution vendors.
> 
> On other words, we may have two separate groups of debugfs entries:
> 
> - entries that won't be changing the driver or GPU state machine;
> - entries that will be affecting the driver, like forcing resets
>   forcing reset to fail, etc. are more dangerous and may deserve an
>   special Kconfig entry.
> 
> On such case, I would be placing a scary message at the Kconfig.debug file,
> like:
> 
> 	config DRM_XE_DANGEROUS_DEBUGFS
> 		bool "Enable driver development debugfs entries
> (DANGEROUS)"
> 		depends on DRM_XE_DEBUG && DEBUGFS
> 		default n
> 
> 		help
> 
> 		  Add special debugfs entries meant to test Xe driver.
> 
> 		  Never enable those in production as those may cause
> 		  malfunctioning at the Xe driver.
> 
> 		  Set it to "N", except if you're debugging Xe driver.
> 
> I would also place those options on a separate debugfs list, e. g.:
Hi Mauro,
Thanks for proposing on this. For the time being can we go ahead and introduce this debugfs in existing list ?
Once the changes are in for separating debugfs list  it into danger_debugfs_list  and debugfs_list 
we will move fake_reset_failure debugfs  to that list along with other applicable entries. 

BR
Himal 
> 
> 	static const struct drm_info_list danger_debugfs_list[] = {
> 		{"force_reset", force_reset, 0},
> 		/*
> 		 * Whatever other node that writing to it would cause
> 		 * side effects at the driver
> 		 */
> 		{"fake_reset_failure", fake_reset_failure, 0},
> 	};
> 
> In time, DRM_XE_USERPTR_INVAL_INJECT is also a dangerous option. Btw,
> shouldn't it be enabled only via a debugfs entry too? If not, then perhaps we
> should use a menuconfig where we'll be placing all those error-inject
> features altogether.
> 
> Regards,
> Mauro