[Intel-xe] [PATCH v2 1/1] drm/xe: Add a debugfs for faking gt reset failure.
Mauro Carvalho Chehab
mauro.chehab at linux.intel.com
Tue Jun 6 08:23:27 UTC 2023
On Tue, 6 Jun 2023 04:03:20 +0000
"Ghimiray, Himal Prasad" <himal.prasad.ghimiray at intel.com> wrote:
> > -----Original Message-----
> > From: Intel-xe <intel-xe-bounces at lists.freedesktop.org> On Behalf Of
> > Mauro Carvalho Chehab
> > Sent: 25 May 2023 11:36
> > To: Vivi, Rodrigo <rodrigo.vivi at intel.com>
> > Cc: mauro.chehab at kernel.org; intel-xe at lists.freedesktop.org
> > Subject: Re: [Intel-xe] [PATCH v2 1/1] drm/xe: Add a debugfs for faking gt
> > reset failure.
> >
> > On Wed, 24 May 2023 16:50:04 +0000
> > "Vivi, Rodrigo" <rodrigo.vivi at intel.com> wrote:
> >
> > > On Wed, 2023-05-24 at 12:43 -0400, Rodrigo Vivi wrote:
> > > On Wed, May 24, 2023 at 09:57:00PM +0530, Balasubramani Vivekanandan
> > wrote:
> > > On 24.05.2023 12:36, Himal Prasad Ghimiray wrote:
> > > In case of gt reset failure, KMD notifies userspace about failure via
> > > uevent. To validate this notification we need to ensure gt reset fails
> > > and there is no mechanism to cause failure from hardware.
> > > Hence added a debugfs which will cause fake reset failure.
> > >
> > > v1(Rodrigo)
> > > - Change the variable to fake_reset_failure_in_progress.
> > > - Drop usage of READ_ONCE and WRITE_ONCE.
> > > - Follow consistency for variable assignment. Either use
> > > functions for all the assignments or don't use for any.
> > >
> > > Cc: Rodrigo Vivi
> > > <rodrigo.vivi at intel.com<mailto:rodrigo.vivi at intel.com>>
> > > Signed-off-by: Himal Prasad Ghimiray
> > > <himal.prasad.ghimiray at intel.com<mailto:himal.prasad.ghimiray at intel.co
> > > m>>
> > > ---
> > > drivers/gpu/drm/xe/xe_gt.c | 15 ++++++++++++++-
> > > drivers/gpu/drm/xe/xe_gt_debugfs.c | 11 +++++++++++
> > > drivers/gpu/drm/xe/xe_gt_types.h | 1 +
> > > 3 files changed, 26 insertions(+), 1 deletion(-)
> > >
> > > diff --git a/drivers/gpu/drm/xe/xe_gt.c b/drivers/gpu/drm/xe/xe_gt.c
> > > index 80d42c7c7cfa..b3ac8b3ab455 100644
> > > --- a/drivers/gpu/drm/xe/xe_gt.c
> > > +++ b/drivers/gpu/drm/xe/xe_gt.c
> > > @@ -506,6 +506,9 @@ int xe_gt_init(struct xe_gt *gt)
> > > int err;
> > > int i;
> > >
> > > + /*Fake reset failure should be disabled by default*/
> > > + gt->reset.fake_reset_failure_in_progress = false;
> > > +
> > > INIT_WORK(>->reset.worker, gt_reset_worker);
> > >
> > > for (i = 0; i < XE_ENGINE_CLASS_MAX; ++i) { @@ -601,8 +604,18
> > > @@ static int gt_reset(struct xe_gt *gt)
> > > xe_gt_info(gt, "reset started\n");
> > >
> > > xe_gt_sanitize(gt);
> > > -
> > > xe_device_mem_access_get(gt_to_xe(gt));
> > > +
> > > + err = gt->reset.fake_reset_failure_in_progress;
> > > + if (err) {
> > >
> > > This doesn't look the right way. It should be something like
> > >
> > > ```
> > > if (gt->reset.fake_reset_failure_in_progress) {
> > >
> > > err = -ECANCELED; or whatever appropriate
> > > ...
> > > ```
> > >
> > > + xe_gt_info(gt, "Fake GT reset failure is in
> > > + progress\n");
> > > +
> > > + /* Disable fake reset failure for next call */
> > > + gt->reset.fake_reset_failure_in_progress = false;
> > > +
> > > + goto err_msg;
> > > + }
> > > +
> > > err = xe_force_wake_get(gt_to_fw(gt), XE_FORCEWAKE_ALL);
> > > if (err)
> > > goto err_msg;
> > > diff --git a/drivers/gpu/drm/xe/xe_gt_debugfs.c
> > > b/drivers/gpu/drm/xe/xe_gt_debugfs.c
> > > index 8bf441e850a0..13d2a974bc00 100644
> > > --- a/drivers/gpu/drm/xe/xe_gt_debugfs.c
> > > +++ b/drivers/gpu/drm/xe/xe_gt_debugfs.c
> > > @@ -127,6 +127,16 @@ static int register_save_restore(struct seq_file *m,
> > void *data)
> > > return 0;
> > > }
> > >
> > > +static int fake_reset_failure(struct seq_file *m, void *data) {
> > > + struct xe_gt *gt = node_to_gt(m->private);
> > > +
> > > + gt->reset.fake_reset_failure_in_progress = true;
> > > + xe_gt_reset_async(gt);
> > > +
> > > + return 0;
> > > +}
> > > +
> > > static const struct drm_info_list debugfs_list[] = {
> > > {"hw_engines", hw_engines, 0},
> > > {"force_reset", force_reset, 0}, @@ -135,6 +145,7 @@ static
> > > const struct drm_info_list debugfs_list[] = {
> > > {"steering", steering, 0},
> > > {"ggtt", ggtt, 0},
> > > {"register-save-restore", register_save_restore, 0},
> > > + {"fake_reset_failure", fake_reset_failure, 0},
> > > };
> > >
> > > void xe_gt_debugfs_register(struct xe_gt *gt) diff --git
> > > a/drivers/gpu/drm/xe/xe_gt_types.h
> > b/drivers/gpu/drm/xe/xe_gt_types.h
> > > index 7c47d67aa8be..746d65c4aacc 100644
> > > --- a/drivers/gpu/drm/xe/xe_gt_types.h
> > > +++ b/drivers/gpu/drm/xe/xe_gt_types.h
> > > @@ -168,6 +168,7 @@ struct xe_gt {
> > >
> > > /** @reset: state for GT resets */
> > > struct {
> > > + bool fake_reset_failure_in_progress;
> > > /**
> > > * @worker: work so GT resets can done async allowing to reset
> > > * code to safely flush all code paths
> > >
> > > @Rodrigo
> > > This feature is purely to support test the driver and not something
> > > which helps debugging. I am concerned whether such code should be part
> > > of production build. I am thinking if this feature or any such patches
> > > in future could be made part of a kernel config, which can be enabled
> > > in CI when the driver is subjected to testing.
> > >
> > > I like this idea of having things like this behind a kernel config,
> > > but we need to be careful to not simply break IGT or force IGT to
> > > always depend on this flag.
> >
> > Well, debugfs is actually something that shouldn't be used in production ;-)
> >
> > Granted, read-only debugfs nodes that would be just printing the engine
> > states without affecting the driver functionality can be present on production
> > environments, as those could be used by sysadmins to report issues to
> > distribution vendors.
> >
> > On other words, we may have two separate groups of debugfs entries:
> >
> > - entries that won't be changing the driver or GPU state machine;
> > - entries that will be affecting the driver, like forcing resets
> > forcing reset to fail, etc. are more dangerous and may deserve an
> > special Kconfig entry.
> >
> > On such case, I would be placing a scary message at the Kconfig.debug file,
> > like:
> >
> > config DRM_XE_DANGEROUS_DEBUGFS
> > bool "Enable driver development debugfs entries
> > (DANGEROUS)"
> > depends on DRM_XE_DEBUG && DEBUGFS
> > default n
> >
> > help
> >
> > Add special debugfs entries meant to test Xe driver.
> >
> > Never enable those in production as those may cause
> > malfunctioning at the Xe driver.
> >
> > Set it to "N", except if you're debugging Xe driver.
> >
> > I would also place those options on a separate debugfs list, e. g.:
> Hi Mauro,
> Thanks for proposing on this. For the time being can we go ahead and introduce this debugfs in existing list ?
> Once the changes are in for separating debugfs list it into danger_debugfs_list and debugfs_list
> we will move fake_reset_failure debugfs to that list along with other applicable entries.
Works for me.
Regards,
Mauro
>
> BR
> Himal
> >
> > static const struct drm_info_list danger_debugfs_list[] = {
> > {"force_reset", force_reset, 0},
> > /*
> > * Whatever other node that writing to it would cause
> > * side effects at the driver
> > */
> > {"fake_reset_failure", fake_reset_failure, 0},
> > };
> >
> > In time, DRM_XE_USERPTR_INVAL_INJECT is also a dangerous option. Btw,
> > shouldn't it be enabled only via a debugfs entry too? If not, then perhaps we
> > should use a menuconfig where we'll be placing all those error-inject
> > features altogether.
> >
> > Regards,
> > Mauro
More information about the Intel-xe
mailing list