[Intel-gfx] [PATCH] drm/i915: Save hangcheck score across resets

Mika Kuoppala mika.kuoppala at linux.intel.com
Thu Oct 6 07:00:09 UTC 2016


Hangcheck score has been zeroed on engine init, which happens
after reset recovery. This has worked well as we always reset
all engines on hang, and also discarded all work submitted
to engines.

With commit 821ed7df6e2a ("drm/i915: Update reset path to fix
incomplete requests") driver gained capability to only discard
the request or requests that were directly involved with the hang,
and those who were deemed innocent, were replayed intact.

Our hangcheck works by periodically sampling the engine state and
then doing checks in multiple stages to see if engine is making
progress. The engine capabilities differ. With render engine, we
have a more ways to measure the progress and thus more checks and
stages. With other engines, we only sample the seqno and head movement.

Now consider that blitter engine is waiting on render and render engine
has a batch which has stuck. Due to simpler checks, the blitter engine
hangcheck score accumulates faster and reaches reset threshold quicker.
We also blame the blitter for the hang as it had the highest score
when recovery started.

Blaming the wrong engine, we don't find the actual guilty request and
most critically, won't make any progress after the reset. That will
lead to second hang, with the same pattern, ad infinitum.

Previously the false blaming of engine was not critical as score was
only used as a trigger for full reset and debug aid in error states.
But now, the score is essential of finding the culprit request.

To fix this, keep the hangcheck scores across resets. We already
have a decay mechanism in place if progress is being made. This
ensures that even if we blame the wrong engine once, we don't
do it twice or consistently, and the real culprit request will be
cleared, real progress will be made and this untangles rest of
the engines and lead to successful recovery.

Bugzilla: https://bugs.freedesktop.org/show_bug.cgi?id=98104
Cc: Chris Wilson <chris at chris-wilson.co.uk>
Signed-off-by: Mika Kuoppala <mika.kuoppala at intel.com>
---
 drivers/gpu/drm/i915/intel_engine_cs.c | 1 -
 1 file changed, 1 deletion(-)

diff --git a/drivers/gpu/drm/i915/intel_engine_cs.c b/drivers/gpu/drm/i915/intel_engine_cs.c
index d00ec805f93d..4bb869eb11bc 100644
--- a/drivers/gpu/drm/i915/intel_engine_cs.c
+++ b/drivers/gpu/drm/i915/intel_engine_cs.c
@@ -209,7 +209,6 @@ void intel_engine_init_seqno(struct intel_engine_cs *engine, u32 seqno)
 
 void intel_engine_init_hangcheck(struct intel_engine_cs *engine)
 {
-	memset(&engine->hangcheck, 0, sizeof(engine->hangcheck));
 	clear_bit(engine->id, &engine->i915->gpu_error.missed_irq_rings);
 	if (intel_engine_has_waiter(engine))
 		i915_queue_hangcheck(engine->i915);
-- 
2.7.4



More information about the Intel-gfx mailing list