[PATCH 3/3] drm/xe: Force wedged state and block GT reset upon any GPU hang

Fri Mar 15 01:06:50 UTC 2024

On Wed, Mar 13, 2024 at 11:00:06PM -0500, Lucas De Marchi wrote:
> On Wed, Mar 13, 2024 at 06:06:14PM -0400, Rodrigo Vivi wrote:
> > On Wed, Mar 13, 2024 at 04:54:38PM -0500, Lucas De Marchi wrote:
> > > On Wed, Mar 13, 2024 at 05:44:00PM -0400, Rodrigo Vivi wrote:
> > > > On Wed, Mar 13, 2024 at 03:49:56PM -0500, Lucas De Marchi wrote:
> > > > > On Wed, Mar 13, 2024 at 03:54:59PM -0400, Rodrigo Vivi wrote:
> > > > > > In many validation situations when debugging GPU Hangs,
> > > > > > it is useful to preserve the GT situation from the moment
> > > > > > that the timeout occurred.
> > > > > >
> > > > > > This patch introduces a module parameter that could be used
> > > > > > on situations like this.
> > > > > >
> > > > > > If xe.wedged module parameter is set to 2, Xe will be declared
> > > > > > wedged on every single execution timeout (a.k.a. GPU hang) right
> > > > > > after devcoredump snapshot capture and without attempting any
> > > > > > kind of GT reset and blocking entirely any kind of execution.
> > > > > >
> > > > > > Cc: Alan Previn <alan.previn.teres.alexis at intel.com>
> > > > > > Cc: Himanshu Somaiya <himanshu.somaiya at intel.com>
> > > > > > Signed-off-by: Rodrigo Vivi <rodrigo.vivi at intel.com>
> > > > > > ---
> > > > > > drivers/gpu/drm/xe/xe_device.c     | 22 ++++++++++++++++++++++
> > > > > > drivers/gpu/drm/xe/xe_device.h     |  6 +-----
> > > > > > drivers/gpu/drm/xe/xe_guc_submit.c |  4 ++++
> > > > > > drivers/gpu/drm/xe/xe_module.c     |  5 +++++
> > > > > > drivers/gpu/drm/xe/xe_module.h     |  1 +
> > > > > > 5 files changed, 33 insertions(+), 5 deletions(-)
> > > > > >
> > > > > > diff --git a/drivers/gpu/drm/xe/xe_device.c b/drivers/gpu/drm/xe/xe_device.c
> > > > > > index 5f0a2bdb7c24..296bc75a55f7 100644
> > > > > > --- a/drivers/gpu/drm/xe/xe_device.c
> > > > > > +++ b/drivers/gpu/drm/xe/xe_device.c
> > > > > > @@ -774,3 +774,25 @@ u64 xe_device_uncanonicalize_addr(struct xe_device *xe, u64 address)
> > > > > > {
> > > > > > 	return address & GENMASK_ULL(xe->info.va_bits - 1, 0);
> > > > > > }
> > > > > > +
> > > > > > +/**
> > > > > > + * xe_device_declare_wedged - Declare device wedged
> > > > > > + * @xe: xe device instance
> > > > > > + *
> > > > > > + * This is a final state that can only be cleared with a module reload.
> > > > >
> > > > > same thing I mentioned about module reload elsewhere.
> > > > >
> > > > > > + * In this state every IOCTL will be blocked so the GT cannot be used.
> > > > > > + * In general it will be called upon any critical error such as gt reset
> > > > > > + * failure or guc loading failure.
> > > > > > + * If xe.wedged module parameter is set to 2, this function will be called
> > > > > > + * on every single execution timeout (a.k.a. GPU hang) right after devcoredump
> > > > > > + * snapshot capture. In this mode, GT reset won't be attempted so the state of
> > > > > > + * the issue is preserved for further debugging.
> > > > >
> > > > > Maybe make it a little bit more drastic so people don't use it when they
> > > > > shouldn't. I really hate the amount of reset=0 I see on machines being
> > > > > used with i915 with the additional patches to achieve a similar goal.
> > > > > IMO it should be a very targeted tool in a toolbox, not the hammer it
> > > > > became.
> > > > >
> > > > > "In this mode, GT reset won't be attempted so there will be no recovery.
> > > > > Any GPU hang will completely kill the GPU so an autopsy may be
> > > > > attempted.  Use with care, if you know what you're doing."
> > > > >
> > > > > > + */
> > > > > > +void xe_device_declare_wedged(struct xe_device *xe)
> > > > > > +{
> > > > > > +	if (xe_modparam.wedged == 0)
> > > > > > +		return;
> > > > > > +
> > > > > > +	atomic_set(&xe->wedged, 1);
> > > > > > +	drm_err(&xe->drm, "CRITICAL: Xe has been declared wedged. A module reload is required.\n");
> > > > > > +}
> > > > > > diff --git a/drivers/gpu/drm/xe/xe_device.h b/drivers/gpu/drm/xe/xe_device.h
> > > > > > index d10664d32f7f..de3149baf82f 100644
> > > > > > --- a/drivers/gpu/drm/xe/xe_device.h
> > > > > > +++ b/drivers/gpu/drm/xe/xe_device.h
> > > > > > @@ -181,10 +181,6 @@ static inline bool xe_device_wedged(struct xe_device *xe)
> > > > > > 	return atomic_read(&xe->wedged);
> > > > > > }
> > > > > >
> > > > > > -static inline void xe_device_declare_wedged(struct xe_device *xe)
> > > > > > -{
> > > > > > -	atomic_set(&xe->wedged, 1);
> > > > > > -	drm_err(&xe->drm, "CRITICAL: Xe has been declared wedged. A module reload is required.\n");
> > > > > > -}
> > > > > > +void xe_device_declare_wedged(struct xe_device *xe);
> > > > > >
> > > > > > #endif
> > > > > > diff --git a/drivers/gpu/drm/xe/xe_guc_submit.c b/drivers/gpu/drm/xe/xe_guc_submit.c
> > > > > > index 19efdb2f881f..987a57205fc4 100644
> > > > > > --- a/drivers/gpu/drm/xe/xe_guc_submit.c
> > > > > > +++ b/drivers/gpu/drm/xe/xe_guc_submit.c
> > > > > > @@ -34,6 +34,7 @@
> > > > > > #include "xe_macros.h"
> > > > > > #include "xe_map.h"
> > > > > > #include "xe_mocs.h"
> > > > > > +#include "xe_module.h"
> > > > > > #include "xe_ring_ops_types.h"
> > > > > > #include "xe_sched_job.h"
> > > > > > #include "xe_trace.h"
> > > > > > @@ -949,6 +950,9 @@ guc_exec_queue_timedout_job(struct drm_sched_job *drm_job)
> > > > > > 	simple_error_capture(q);
> > > > > > 	xe_devcoredump(job);
> > > > > >
> > > > > > +	if (xe_modparam.wedged == 2)
> > > > > > +		xe_device_declare_wedged(xe);
> > > > > > +
> > > > > > 	trace_xe_sched_job_timedout(job);
> > > > > >
> > > > > > 	/* Kill the run_job entry point */
> > > > > > diff --git a/drivers/gpu/drm/xe/xe_module.c b/drivers/gpu/drm/xe/xe_module.c
> > > > > > index 110b69864656..61272553f40f 100644
> > > > > > --- a/drivers/gpu/drm/xe/xe_module.c
> > > > > > +++ b/drivers/gpu/drm/xe/xe_module.c
> > > > > > @@ -17,6 +17,7 @@ struct xe_modparam xe_modparam = {
> > > > > > 	.enable_display = true,
> > > > > > 	.guc_log_level = 5,
> > > > > > 	.force_probe = CONFIG_DRM_XE_FORCE_PROBE,
> > > > > > +	.wedged = 1,
> > > > > > 	/* the rest are 0 by default */
> > > > > > };
> > > > > >
> > > > > > @@ -48,6 +49,10 @@ module_param_named_unsafe(force_probe, xe_modparam.force_probe, charp, 0400);
> > > > > > MODULE_PARM_DESC(force_probe,
> > > > > > 		 "Force probe options for specified devices. See CONFIG_DRM_XE_FORCE_PROBE for details.");
> > > > > >
> > > > > > +module_param_named_unsafe(wedged, xe_modparam.wedged, int, 0600);
> > > > > > +MODULE_PARM_DESC(wedged,
> > > > > > +		 "Wedged mode - 0=never, 1=upon-critical-errors[default], 2=upon-any-hang");
> > > > >
> > > > > as we chatted earlier today, I think there's one thing missing. If an
> > > > > engine hangs, we won't notice as GuC will reset the engine by itself.
> > > > > We still need to pass GLOBAL_POLICY_DISABLE_ENGINE_RESET as a policy to
> > > > > GuC in the ADS structure. This is what I was attempting earlier today.
> > > > >
> > > > > I was wondering if we could even update that in runtime with
> > > > > INTEL_GUC_ACTION_GLOBAL_SCHED_POLICY_CHANGE so we can target the debug
> > > > > only to when we are ready to reproduce the issues and only on the
> > > > > specific device.
> > > >
> > > > I like that... Perhaps instead of the module parameter we have a debugfs
> > > > entry where we switch the state?
> > > >
> > > > So, on a debugfs write to switch to this mode that gets 'busted'/'wedged'/'dead'
> > > > on every timeout/hang, then at that time we send this sched policy change?
> > > 
> > > the shortcome that we always have is that it doesn't cover the probing
> > > part. One common problem on early device support (either because of hw
> > > issues or driver/guc bugs) is on default context submission. If that
> > > receives a timeout, shouldn't we declare it busted?
> > 
> > hmm... indeed.
> > 
> > so, perhaps continue with the mod param xe.busted=2 but then right before any
> > submission we check for xe.modparam.busted == 2 to send the
> > INTEL_GUC_ACTION_GLOBAL_SCHED_POLICY_CHANGE before sending the command.
> > 
> > so on the next upcoming execution it would avoid the gt reset entirely.
> > 
> > then we could use it either at boot with xe.busted=2
> > or at runtime by
> > echo 2 | sudo tee /sys/module/xe/parameters/busted
> 
> I think we can use the modparam on probe and already put it in ads.
> That dictates the default behavior for the _module_ regardless of the device.

agreed. I already sent the 3 patches that accomplished that.

> Then we allow either setting the param to change the default behavior
> like above or we create a debugfs so we can set it per-device after the
> probe.

I have the 4th patch in here: https://github.com/rodrigovivi/linux/commits/xe-busted
that is targeting this goal. However I'm still dealing with trying to change
the guc sched policy on the fly.

I'm not convinced that i915 code around that 0x506 command is the right code,
so I'm still investigating the spec and doing some experiments.

But I'd like to move forward with this default behavior with module
parameter so we unblock our sv teams.

thoughts?

Thanks,
Rodrigo.

> 
> Advantage of having the debugfs is that we that we could even go on
> other devices we don't want to debug and set them back to 0 too.
> 
> Lucas De Marchi