[PATCH 3/3] drm/xe: Force wedged state and block GT reset upon any GPU hang

Wed Mar 13 20:56:55 UTC 2024

++

-----Original Message-----
From: De Marchi, Lucas <lucas.demarchi at intel.com> 
Sent: Wednesday, March 13, 2024 1:50 PM
To: Vivi, Rodrigo <rodrigo.vivi at intel.com>
Cc: intel-xe at lists.freedesktop.org; Teres Alexis, Alan Previn <alan.previn.teres.alexis at intel.com>; Somaiya, Himanshu <himanshu.somaiya at intel.com>
Subject: Re: [PATCH 3/3] drm/xe: Force wedged state and block GT reset upon any GPU hang

On Wed, Mar 13, 2024 at 03:54:59PM -0400, Rodrigo Vivi wrote:
>In many validation situations when debugging GPU Hangs, it is useful to 
>preserve the GT situation from the moment that the timeout occurred.
>
>This patch introduces a module parameter that could be used on 
>situations like this.
>
>If xe.wedged module parameter is set to 2, Xe will be declared wedged 
>on every single execution timeout (a.k.a. GPU hang) right after 
>devcoredump snapshot capture and without attempting any kind of GT 
>reset and blocking entirely any kind of execution.
>
>Cc: Alan Previn <alan.previn.teres.alexis at intel.com>
>Cc: Himanshu Somaiya <himanshu.somaiya at intel.com>
>Signed-off-by: Rodrigo Vivi <rodrigo.vivi at intel.com>
>---
> drivers/gpu/drm/xe/xe_device.c     | 22 ++++++++++++++++++++++
> drivers/gpu/drm/xe/xe_device.h     |  6 +-----
> drivers/gpu/drm/xe/xe_guc_submit.c |  4 ++++
> drivers/gpu/drm/xe/xe_module.c     |  5 +++++
> drivers/gpu/drm/xe/xe_module.h     |  1 +
> 5 files changed, 33 insertions(+), 5 deletions(-)
>
>diff --git a/drivers/gpu/drm/xe/xe_device.c 
>b/drivers/gpu/drm/xe/xe_device.c index 5f0a2bdb7c24..296bc75a55f7 
>100644
>--- a/drivers/gpu/drm/xe/xe_device.c
>+++ b/drivers/gpu/drm/xe/xe_device.c
>@@ -774,3 +774,25 @@ u64 xe_device_uncanonicalize_addr(struct xe_device 
>*xe, u64 address)  {
> 	return address & GENMASK_ULL(xe->info.va_bits - 1, 0);  }
>+
>+/**
>+ * xe_device_declare_wedged - Declare device wedged
>+ * @xe: xe device instance
>+ *
>+ * This is a final state that can only be cleared with a module reload.

same thing I mentioned about module reload elsewhere.

>+ * In this state every IOCTL will be blocked so the GT cannot be used.
>+ * In general it will be called upon any critical error such as gt 
>+ reset
>+ * failure or guc loading failure.
>+ * If xe.wedged module parameter is set to 2, this function will be 
>+ called
>+ * on every single execution timeout (a.k.a. GPU hang) right after 
>+ devcoredump
>+ * snapshot capture. In this mode, GT reset won't be attempted so the 
>+ state of
>+ * the issue is preserved for further debugging.

Maybe make it a little bit more drastic so people don't use it when they shouldn't. I really hate the amount of reset=0 I see on machines being used with i915 with the additional patches to achieve a similar goal.
IMO it should be a very targeted tool in a toolbox, not the hammer it became.

"In this mode, GT reset won't be attempted so there will be no recovery.
Any GPU hang will completely kill the GPU so an autopsy may be attempted.  Use with care, if you know what you're doing."

>+ */
>+void xe_device_declare_wedged(struct xe_device *xe) {
>+	if (xe_modparam.wedged == 0)
>+		return;
>+
>+	atomic_set(&xe->wedged, 1);
>+	drm_err(&xe->drm, "CRITICAL: Xe has been declared wedged. A module 
>+reload is required.\n"); }
>diff --git a/drivers/gpu/drm/xe/xe_device.h 
>b/drivers/gpu/drm/xe/xe_device.h index d10664d32f7f..de3149baf82f 
>100644
>--- a/drivers/gpu/drm/xe/xe_device.h
>+++ b/drivers/gpu/drm/xe/xe_device.h
>@@ -181,10 +181,6 @@ static inline bool xe_device_wedged(struct xe_device *xe)
> 	return atomic_read(&xe->wedged);
> }
>
>-static inline void xe_device_declare_wedged(struct xe_device *xe) -{
>-	atomic_set(&xe->wedged, 1);
>-	drm_err(&xe->drm, "CRITICAL: Xe has been declared wedged. A module reload is required.\n");
>-}
>+void xe_device_declare_wedged(struct xe_device *xe);
>
> #endif
>diff --git a/drivers/gpu/drm/xe/xe_guc_submit.c 
>b/drivers/gpu/drm/xe/xe_guc_submit.c
>index 19efdb2f881f..987a57205fc4 100644
>--- a/drivers/gpu/drm/xe/xe_guc_submit.c
>+++ b/drivers/gpu/drm/xe/xe_guc_submit.c
>@@ -34,6 +34,7 @@
> #include "xe_macros.h"
> #include "xe_map.h"
> #include "xe_mocs.h"
>+#include "xe_module.h"
> #include "xe_ring_ops_types.h"
> #include "xe_sched_job.h"
> #include "xe_trace.h"
>@@ -949,6 +950,9 @@ guc_exec_queue_timedout_job(struct drm_sched_job *drm_job)
> 	simple_error_capture(q);
> 	xe_devcoredump(job);
>
>+	if (xe_modparam.wedged == 2)
>+		xe_device_declare_wedged(xe);
>+
> 	trace_xe_sched_job_timedout(job);
>
> 	/* Kill the run_job entry point */
>diff --git a/drivers/gpu/drm/xe/xe_module.c 
>b/drivers/gpu/drm/xe/xe_module.c index 110b69864656..61272553f40f 
>100644
>--- a/drivers/gpu/drm/xe/xe_module.c
>+++ b/drivers/gpu/drm/xe/xe_module.c
>@@ -17,6 +17,7 @@ struct xe_modparam xe_modparam = {
> 	.enable_display = true,
> 	.guc_log_level = 5,
> 	.force_probe = CONFIG_DRM_XE_FORCE_PROBE,
>+	.wedged = 1,
> 	/* the rest are 0 by default */
> };
>
>@@ -48,6 +49,10 @@ module_param_named_unsafe(force_probe, 
>xe_modparam.force_probe, charp, 0400);  MODULE_PARM_DESC(force_probe,
> 		 "Force probe options for specified devices. See 
>CONFIG_DRM_XE_FORCE_PROBE for details.");
>
>+module_param_named_unsafe(wedged, xe_modparam.wedged, int, 0600); 
>+MODULE_PARM_DESC(wedged,
>+		 "Wedged mode - 0=never, 1=upon-critical-errors[default], 
>+2=upon-any-hang");

as we chatted earlier today, I think there's one thing missing. If an engine hangs, we won't notice as GuC will reset the engine by itself.
We still need to pass GLOBAL_POLICY_DISABLE_ENGINE_RESET as a policy to GuC in the ADS structure. This is what I was attempting earlier today.

I was wondering if we could even update that in runtime with INTEL_GUC_ACTION_GLOBAL_SCHED_POLICY_CHANGE so we can target the debug only to when we are ready to reproduce the issues and only on the specific device.

Lucas De Marchi

>+
> struct init_funcs {
> 	int (*init)(void);
> 	void (*exit)(void);
>diff --git a/drivers/gpu/drm/xe/xe_module.h 
>b/drivers/gpu/drm/xe/xe_module.h index 88ef0e8b2bfd..0f95dcd8942f 
>100644
>--- a/drivers/gpu/drm/xe/xe_module.h
>+++ b/drivers/gpu/drm/xe/xe_module.h
>@@ -18,6 +18,7 @@ struct xe_modparam {
> 	char *huc_firmware_path;
> 	char *gsc_firmware_path;
> 	char *force_probe;
>+	int wedged;
> };
>
> extern struct xe_modparam xe_modparam;
>--
>2.44.0
>