[Intel-xe] [PATCH v2 1/1] drm/xe: Add a debugfs for faking gt reset failure.
Ghimiray, Himal Prasad
himal.prasad.ghimiray at intel.com
Tue Jun 6 04:03:20 UTC 2023
> -----Original Message-----
> From: Intel-xe <intel-xe-bounces at lists.freedesktop.org> On Behalf Of
> Mauro Carvalho Chehab
> Sent: 25 May 2023 11:36
> To: Vivi, Rodrigo <rodrigo.vivi at intel.com>
> Cc: mauro.chehab at kernel.org; intel-xe at lists.freedesktop.org
> Subject: Re: [Intel-xe] [PATCH v2 1/1] drm/xe: Add a debugfs for faking gt
> reset failure.
>
> On Wed, 24 May 2023 16:50:04 +0000
> "Vivi, Rodrigo" <rodrigo.vivi at intel.com> wrote:
>
> > On Wed, 2023-05-24 at 12:43 -0400, Rodrigo Vivi wrote:
> > On Wed, May 24, 2023 at 09:57:00PM +0530, Balasubramani Vivekanandan
> wrote:
> > On 24.05.2023 12:36, Himal Prasad Ghimiray wrote:
> > In case of gt reset failure, KMD notifies userspace about failure via
> > uevent. To validate this notification we need to ensure gt reset fails
> > and there is no mechanism to cause failure from hardware.
> > Hence added a debugfs which will cause fake reset failure.
> >
> > v1(Rodrigo)
> > - Change the variable to fake_reset_failure_in_progress.
> > - Drop usage of READ_ONCE and WRITE_ONCE.
> > - Follow consistency for variable assignment. Either use
> > functions for all the assignments or don't use for any.
> >
> > Cc: Rodrigo Vivi
> > <rodrigo.vivi at intel.com<mailto:rodrigo.vivi at intel.com>>
> > Signed-off-by: Himal Prasad Ghimiray
> > <himal.prasad.ghimiray at intel.com<mailto:himal.prasad.ghimiray at intel.co
> > m>>
> > ---
> > drivers/gpu/drm/xe/xe_gt.c | 15 ++++++++++++++-
> > drivers/gpu/drm/xe/xe_gt_debugfs.c | 11 +++++++++++
> > drivers/gpu/drm/xe/xe_gt_types.h | 1 +
> > 3 files changed, 26 insertions(+), 1 deletion(-)
> >
> > diff --git a/drivers/gpu/drm/xe/xe_gt.c b/drivers/gpu/drm/xe/xe_gt.c
> > index 80d42c7c7cfa..b3ac8b3ab455 100644
> > --- a/drivers/gpu/drm/xe/xe_gt.c
> > +++ b/drivers/gpu/drm/xe/xe_gt.c
> > @@ -506,6 +506,9 @@ int xe_gt_init(struct xe_gt *gt)
> > int err;
> > int i;
> >
> > + /*Fake reset failure should be disabled by default*/
> > + gt->reset.fake_reset_failure_in_progress = false;
> > +
> > INIT_WORK(>->reset.worker, gt_reset_worker);
> >
> > for (i = 0; i < XE_ENGINE_CLASS_MAX; ++i) { @@ -601,8 +604,18
> > @@ static int gt_reset(struct xe_gt *gt)
> > xe_gt_info(gt, "reset started\n");
> >
> > xe_gt_sanitize(gt);
> > -
> > xe_device_mem_access_get(gt_to_xe(gt));
> > +
> > + err = gt->reset.fake_reset_failure_in_progress;
> > + if (err) {
> >
> > This doesn't look the right way. It should be something like
> >
> > ```
> > if (gt->reset.fake_reset_failure_in_progress) {
> >
> > err = -ECANCELED; or whatever appropriate
> > ...
> > ```
> >
> > + xe_gt_info(gt, "Fake GT reset failure is in
> > + progress\n");
> > +
> > + /* Disable fake reset failure for next call */
> > + gt->reset.fake_reset_failure_in_progress = false;
> > +
> > + goto err_msg;
> > + }
> > +
> > err = xe_force_wake_get(gt_to_fw(gt), XE_FORCEWAKE_ALL);
> > if (err)
> > goto err_msg;
> > diff --git a/drivers/gpu/drm/xe/xe_gt_debugfs.c
> > b/drivers/gpu/drm/xe/xe_gt_debugfs.c
> > index 8bf441e850a0..13d2a974bc00 100644
> > --- a/drivers/gpu/drm/xe/xe_gt_debugfs.c
> > +++ b/drivers/gpu/drm/xe/xe_gt_debugfs.c
> > @@ -127,6 +127,16 @@ static int register_save_restore(struct seq_file *m,
> void *data)
> > return 0;
> > }
> >
> > +static int fake_reset_failure(struct seq_file *m, void *data) {
> > + struct xe_gt *gt = node_to_gt(m->private);
> > +
> > + gt->reset.fake_reset_failure_in_progress = true;
> > + xe_gt_reset_async(gt);
> > +
> > + return 0;
> > +}
> > +
> > static const struct drm_info_list debugfs_list[] = {
> > {"hw_engines", hw_engines, 0},
> > {"force_reset", force_reset, 0}, @@ -135,6 +145,7 @@ static
> > const struct drm_info_list debugfs_list[] = {
> > {"steering", steering, 0},
> > {"ggtt", ggtt, 0},
> > {"register-save-restore", register_save_restore, 0},
> > + {"fake_reset_failure", fake_reset_failure, 0},
> > };
> >
> > void xe_gt_debugfs_register(struct xe_gt *gt) diff --git
> > a/drivers/gpu/drm/xe/xe_gt_types.h
> b/drivers/gpu/drm/xe/xe_gt_types.h
> > index 7c47d67aa8be..746d65c4aacc 100644
> > --- a/drivers/gpu/drm/xe/xe_gt_types.h
> > +++ b/drivers/gpu/drm/xe/xe_gt_types.h
> > @@ -168,6 +168,7 @@ struct xe_gt {
> >
> > /** @reset: state for GT resets */
> > struct {
> > + bool fake_reset_failure_in_progress;
> > /**
> > * @worker: work so GT resets can done async allowing to reset
> > * code to safely flush all code paths
> >
> > @Rodrigo
> > This feature is purely to support test the driver and not something
> > which helps debugging. I am concerned whether such code should be part
> > of production build. I am thinking if this feature or any such patches
> > in future could be made part of a kernel config, which can be enabled
> > in CI when the driver is subjected to testing.
> >
> > I like this idea of having things like this behind a kernel config,
> > but we need to be careful to not simply break IGT or force IGT to
> > always depend on this flag.
>
> Well, debugfs is actually something that shouldn't be used in production ;-)
>
> Granted, read-only debugfs nodes that would be just printing the engine
> states without affecting the driver functionality can be present on production
> environments, as those could be used by sysadmins to report issues to
> distribution vendors.
>
> On other words, we may have two separate groups of debugfs entries:
>
> - entries that won't be changing the driver or GPU state machine;
> - entries that will be affecting the driver, like forcing resets
> forcing reset to fail, etc. are more dangerous and may deserve an
> special Kconfig entry.
>
> On such case, I would be placing a scary message at the Kconfig.debug file,
> like:
>
> config DRM_XE_DANGEROUS_DEBUGFS
> bool "Enable driver development debugfs entries
> (DANGEROUS)"
> depends on DRM_XE_DEBUG && DEBUGFS
> default n
>
> help
>
> Add special debugfs entries meant to test Xe driver.
>
> Never enable those in production as those may cause
> malfunctioning at the Xe driver.
>
> Set it to "N", except if you're debugging Xe driver.
>
> I would also place those options on a separate debugfs list, e. g.:
Hi Mauro,
Thanks for proposing on this. For the time being can we go ahead and introduce this debugfs in existing list ?
Once the changes are in for separating debugfs list it into danger_debugfs_list and debugfs_list
we will move fake_reset_failure debugfs to that list along with other applicable entries.
BR
Himal
>
> static const struct drm_info_list danger_debugfs_list[] = {
> {"force_reset", force_reset, 0},
> /*
> * Whatever other node that writing to it would cause
> * side effects at the driver
> */
> {"fake_reset_failure", fake_reset_failure, 0},
> };
>
> In time, DRM_XE_USERPTR_INVAL_INJECT is also a dangerous option. Btw,
> shouldn't it be enabled only via a debugfs entry too? If not, then perhaps we
> should use a menuconfig where we'll be placing all those error-inject
> features altogether.
>
> Regards,
> Mauro
More information about the Intel-xe
mailing list