[Intel-gfx] [PATCH v2] drm/i915: Taint (TAINT_DIE) the kernel if the GPU reset fails

Chris Wilson chris at chris-wilson.co.uk
Wed Nov 29 14:05:33 UTC 2017


History tells us that if we cannot reset the GPU now, we never will. This
then impacts everything that is run subsequently. On failing the reset,
we mark the driver as wedged, trying to prevent further execution on the
GPU, forcing userspace to fallback to using the CPU to update its
framebuffers and let the user know what happened.

We also want to go one step further and add a taint to the kernel so that
any subsequent faults can be traced back to this failure. This is
important for igt, where if the GPU/driver fails we want to reboot and
restart testing rather than continue on into oblivion.

TAINT_DIE is colloquially known as "system on fire", which seems
appropriate for unresponsive hardware.

v2: Also taint if the recovery fails (again history shows us that is
typically fatal).

References: https://bugs.freedesktop.org/show_bug.cgi?id=103514
Signed-off-by: Chris Wilson <chris at chris-wilson.co.uk>
Cc: Mika Kuoppala <mika.kuoppala at linux.intel.com>
Cc: Daniel Vetter <daniel.vetter at ffwll.ch>
Cc: MichaƂ Winiarski <michal.winiarski at intel.com>
---
 drivers/gpu/drm/i915/i915_drv.c | 28 ++++++++++++++++++++++------
 1 file changed, 22 insertions(+), 6 deletions(-)

diff --git a/drivers/gpu/drm/i915/i915_drv.c b/drivers/gpu/drm/i915/i915_drv.c
index 696d5cdf2779..eb90ddac7f8b 100644
--- a/drivers/gpu/drm/i915/i915_drv.c
+++ b/drivers/gpu/drm/i915/i915_drv.c
@@ -1897,18 +1897,21 @@ void i915_reset(struct drm_i915_private *i915, unsigned int flags)
 	disable_irq(i915->drm.irq);
 	ret = i915_gem_reset_prepare(i915);
 	if (ret) {
-		DRM_ERROR("GPU recovery failed\n");
+		dev_err(i915->drm.dev, "GPU recovery failed\n");
 		intel_gpu_reset(i915, ALL_ENGINES);
-		goto error;
+		goto taint;
 	}
 
 	ret = intel_gpu_reset(i915, ALL_ENGINES);
 	if (ret) {
-		if (ret != -ENODEV)
-			DRM_ERROR("Failed to reset chip: %i\n", ret);
-		else
+		if (ret != -ENODEV) {
+			dev_err(i915->drm.dev,
+				"Failed to reset chip: %i\n", ret);
+			goto taint;
+		} else {
 			DRM_DEBUG_DRIVER("GPU reset disabled\n");
-		goto error;
+			goto error;
+		}
 	}
 
 	i915_gem_reset(i915);
@@ -1951,6 +1954,19 @@ void i915_reset(struct drm_i915_private *i915, unsigned int flags)
 	wake_up_bit(&error->flags, I915_RESET_HANDOFF);
 	return;
 
+taint:
+	/*
+	 * History tells us that if we cannot reset the GPU now, we
+	 * never will. This then impacts everything that is run
+	 * subsequently. On failing the reset, we mark the driver
+	 * as wedged, preventing further execution on the GPU.
+	 * We also want to go one step further and add a taint to the
+	 * kernel so that any subsequent faults can be traced back to
+	 * this failure. This is important for igt, where if the
+	 * GPU/driver fails we want to reboot and restart testing
+	 * rather than continue on into oblivion.
+	 */
+	add_taint(TAINT_DIE, LOCKDEP_STILL_OK);
 error:
 	i915_gem_set_wedged(i915);
 	i915_gem_retire_requests(i915);
-- 
2.15.0



More information about the Intel-gfx mailing list