[Intel-gfx] [PATCH v2] drm/i915: Taint (TAINT_DIE) the kernel if the GPU reset fails
Lofstedt, Marta
marta.lofstedt at intel.com
Thu Nov 30 12:24:49 UTC 2017
> -----Original Message-----
> From: Intel-gfx [mailto:intel-gfx-bounces at lists.freedesktop.org] On Behalf
> Of Chris Wilson
> Sent: Wednesday, November 29, 2017 4:06 PM
> To: intel-gfx at lists.freedesktop.org
> Cc: Daniel Vetter <daniel.vetter at ffwll.ch>
> Subject: [Intel-gfx] [PATCH v2] drm/i915: Taint (TAINT_DIE) the kernel if the
> GPU reset fails
>
> History tells us that if we cannot reset the GPU now, we never will. This then
> impacts everything that is run subsequently. On failing the reset, we mark
> the driver as wedged, trying to prevent further execution on the GPU,
> forcing userspace to fallback to using the CPU to update its framebuffers and
> let the user know what happened.
>
> We also want to go one step further and add a taint to the kernel so that any
> subsequent faults can be traced back to this failure. This is important for igt,
> where if the GPU/driver fails we want to reboot and restart testing rather
> than continue on into oblivion.
I am OK with this if the comment deciding what IGT wants to do when this state is collected is removed.
I just want to make it clear that the tainting feature is separated from what that should be done when the system is tainted.
>
> TAINT_DIE is colloquially known as "system on fire", which seems
> appropriate for unresponsive hardware.
>
> v2: Also taint if the recovery fails (again history shows us that is typically
> fatal).
>
> References: https://bugs.freedesktop.org/show_bug.cgi?id=103514
> Signed-off-by: Chris Wilson <chris at chris-wilson.co.uk>
> Cc: Mika Kuoppala <mika.kuoppala at linux.intel.com>
> Cc: Daniel Vetter <daniel.vetter at ffwll.ch>
> Cc: Michał Winiarski <michal.winiarski at intel.com>
> ---
> drivers/gpu/drm/i915/i915_drv.c | 28 ++++++++++++++++++++++------
> 1 file changed, 22 insertions(+), 6 deletions(-)
>
> diff --git a/drivers/gpu/drm/i915/i915_drv.c
> b/drivers/gpu/drm/i915/i915_drv.c index 696d5cdf2779..eb90ddac7f8b
> 100644
> --- a/drivers/gpu/drm/i915/i915_drv.c
> +++ b/drivers/gpu/drm/i915/i915_drv.c
> @@ -1897,18 +1897,21 @@ void i915_reset(struct drm_i915_private *i915,
> unsigned int flags)
> disable_irq(i915->drm.irq);
> ret = i915_gem_reset_prepare(i915);
> if (ret) {
> - DRM_ERROR("GPU recovery failed\n");
> + dev_err(i915->drm.dev, "GPU recovery
> failed\n");
> intel_gpu_reset(i915, ALL_ENGINES);
> - goto error;
> + goto taint;
> }
>
> ret = intel_gpu_reset(i915, ALL_ENGINES);
> if (ret) {
> - if (ret != -ENODEV)
> - DRM_ERROR("Failed to reset chip:
> %i\n", ret);
> - else
> + if (ret != -ENODEV) {
> + dev_err(i915->drm.dev,
> + "Failed to reset
> chip: %i\n", ret);
> + goto taint;
> + } else {
> DRM_DEBUG_DRIVER("GPU reset
> disabled\n");
> - goto error;
> + goto error;
> + }
> }
>
> i915_gem_reset(i915);
> @@ -1951,6 +1954,19 @@ void i915_reset(struct drm_i915_private *i915,
> unsigned int flags)
> wake_up_bit(&error->flags, I915_RESET_HANDOFF);
> return;
>
> +taint:
> + /*
> + * History tells us that if we cannot reset the GPU now, we
> + * never will. This then impacts everything that is run
> + * subsequently. On failing the reset, we mark the driver
> + * as wedged, preventing further execution on the GPU.
> + * We also want to go one step further and add a taint to the
> + * kernel so that any subsequent faults can be traced back to
> + * this failure. This is important for igt, where if the
> + * GPU/driver fails we want to reboot and restart testing
> + * rather than continue on into oblivion.
> + */
> + add_taint(TAINT_DIE, LOCKDEP_STILL_OK);
> error:
> i915_gem_set_wedged(i915);
> i915_gem_retire_requests(i915);
> --
> 2.15.0
>
> _______________________________________________
> Intel-gfx mailing list
> Intel-gfx at lists.freedesktop.org
> https://lists.freedesktop.org/mailman/listinfo/intel-gfx
More information about the Intel-gfx
mailing list