[PATCH 3/3] drm/xe: Force wedged state and block GT reset upon any GPU hang

Wed Mar 13 20:49:56 UTC 2024

On Wed, Mar 13, 2024 at 03:54:59PM -0400, Rodrigo Vivi wrote:
>In many validation situations when debugging GPU Hangs,
>it is useful to preserve the GT situation from the moment
>that the timeout occurred.
>
>This patch introduces a module parameter that could be used
>on situations like this.
>
>If xe.wedged module parameter is set to 2, Xe will be declared
>wedged on every single execution timeout (a.k.a. GPU hang) right
>after devcoredump snapshot capture and without attempting any
>kind of GT reset and blocking entirely any kind of execution.
>
>Cc: Alan Previn <alan.previn.teres.alexis at intel.com>
>Cc: Himanshu Somaiya <himanshu.somaiya at intel.com>
>Signed-off-by: Rodrigo Vivi <rodrigo.vivi at intel.com>
>---
> drivers/gpu/drm/xe/xe_device.c     | 22 ++++++++++++++++++++++
> drivers/gpu/drm/xe/xe_device.h     |  6 +-----
> drivers/gpu/drm/xe/xe_guc_submit.c |  4 ++++
> drivers/gpu/drm/xe/xe_module.c     |  5 +++++
> drivers/gpu/drm/xe/xe_module.h     |  1 +
> 5 files changed, 33 insertions(+), 5 deletions(-)
>
>diff --git a/drivers/gpu/drm/xe/xe_device.c b/drivers/gpu/drm/xe/xe_device.c
>index 5f0a2bdb7c24..296bc75a55f7 100644
>--- a/drivers/gpu/drm/xe/xe_device.c
>+++ b/drivers/gpu/drm/xe/xe_device.c
>@@ -774,3 +774,25 @@ u64 xe_device_uncanonicalize_addr(struct xe_device *xe, u64 address)
> {
> 	return address & GENMASK_ULL(xe->info.va_bits - 1, 0);
> }
>+
>+/**
>+ * xe_device_declare_wedged - Declare device wedged
>+ * @xe: xe device instance
>+ *
>+ * This is a final state that can only be cleared with a module reload.

same thing I mentioned about module reload elsewhere.

>+ * In this state every IOCTL will be blocked so the GT cannot be used.
>+ * In general it will be called upon any critical error such as gt reset
>+ * failure or guc loading failure.
>+ * If xe.wedged module parameter is set to 2, this function will be called
>+ * on every single execution timeout (a.k.a. GPU hang) right after devcoredump
>+ * snapshot capture. In this mode, GT reset won't be attempted so the state of
>+ * the issue is preserved for further debugging.

Maybe make it a little bit more drastic so people don't use it when they
shouldn't. I really hate the amount of reset=0 I see on machines being
used with i915 with the additional patches to achieve a similar goal.
IMO it should be a very targeted tool in a toolbox, not the hammer it
became.

"In this mode, GT reset won't be attempted so there will be no recovery.
Any GPU hang will completely kill the GPU so an autopsy may be
attempted.  Use with care, if you know what you're doing."

>+ */
>+void xe_device_declare_wedged(struct xe_device *xe)
>+{
>+	if (xe_modparam.wedged == 0)
>+		return;
>+
>+	atomic_set(&xe->wedged, 1);
>+	drm_err(&xe->drm, "CRITICAL: Xe has been declared wedged. A module reload is required.\n");
>+}
>diff --git a/drivers/gpu/drm/xe/xe_device.h b/drivers/gpu/drm/xe/xe_device.h
>index d10664d32f7f..de3149baf82f 100644
>--- a/drivers/gpu/drm/xe/xe_device.h
>+++ b/drivers/gpu/drm/xe/xe_device.h
>@@ -181,10 +181,6 @@ static inline bool xe_device_wedged(struct xe_device *xe)
> 	return atomic_read(&xe->wedged);
> }
>
>-static inline void xe_device_declare_wedged(struct xe_device *xe)
>-{
>-	atomic_set(&xe->wedged, 1);
>-	drm_err(&xe->drm, "CRITICAL: Xe has been declared wedged. A module reload is required.\n");
>-}
>+void xe_device_declare_wedged(struct xe_device *xe);
>
> #endif
>diff --git a/drivers/gpu/drm/xe/xe_guc_submit.c b/drivers/gpu/drm/xe/xe_guc_submit.c
>index 19efdb2f881f..987a57205fc4 100644
>--- a/drivers/gpu/drm/xe/xe_guc_submit.c
>+++ b/drivers/gpu/drm/xe/xe_guc_submit.c
>@@ -34,6 +34,7 @@
> #include "xe_macros.h"
> #include "xe_map.h"
> #include "xe_mocs.h"
>+#include "xe_module.h"
> #include "xe_ring_ops_types.h"
> #include "xe_sched_job.h"
> #include "xe_trace.h"
>@@ -949,6 +950,9 @@ guc_exec_queue_timedout_job(struct drm_sched_job *drm_job)
> 	simple_error_capture(q);
> 	xe_devcoredump(job);
>
>+	if (xe_modparam.wedged == 2)
>+		xe_device_declare_wedged(xe);
>+
> 	trace_xe_sched_job_timedout(job);
>
> 	/* Kill the run_job entry point */
>diff --git a/drivers/gpu/drm/xe/xe_module.c b/drivers/gpu/drm/xe/xe_module.c
>index 110b69864656..61272553f40f 100644
>--- a/drivers/gpu/drm/xe/xe_module.c
>+++ b/drivers/gpu/drm/xe/xe_module.c
>@@ -17,6 +17,7 @@ struct xe_modparam xe_modparam = {
> 	.enable_display = true,
> 	.guc_log_level = 5,
> 	.force_probe = CONFIG_DRM_XE_FORCE_PROBE,
>+	.wedged = 1,
> 	/* the rest are 0 by default */
> };
>
>@@ -48,6 +49,10 @@ module_param_named_unsafe(force_probe, xe_modparam.force_probe, charp, 0400);
> MODULE_PARM_DESC(force_probe,
> 		 "Force probe options for specified devices. See CONFIG_DRM_XE_FORCE_PROBE for details.");
>
>+module_param_named_unsafe(wedged, xe_modparam.wedged, int, 0600);
>+MODULE_PARM_DESC(wedged,
>+		 "Wedged mode - 0=never, 1=upon-critical-errors[default], 2=upon-any-hang");

as we chatted earlier today, I think there's one thing missing. If an
engine hangs, we won't notice as GuC will reset the engine by itself.
We still need to pass GLOBAL_POLICY_DISABLE_ENGINE_RESET as a policy to
GuC in the ADS structure. This is what I was attempting earlier today.

I was wondering if we could even update that in runtime with
INTEL_GUC_ACTION_GLOBAL_SCHED_POLICY_CHANGE so we can target the debug
only to when we are ready to reproduce the issues and only on the
specific device.

Lucas De Marchi

>+
> struct init_funcs {
> 	int (*init)(void);
> 	void (*exit)(void);
>diff --git a/drivers/gpu/drm/xe/xe_module.h b/drivers/gpu/drm/xe/xe_module.h
>index 88ef0e8b2bfd..0f95dcd8942f 100644
>--- a/drivers/gpu/drm/xe/xe_module.h
>+++ b/drivers/gpu/drm/xe/xe_module.h
>@@ -18,6 +18,7 @@ struct xe_modparam {
> 	char *huc_firmware_path;
> 	char *gsc_firmware_path;
> 	char *force_probe;
>+	int wedged;
> };
>
> extern struct xe_modparam xe_modparam;
>-- 
>2.44.0
>