[Intel-gfx] [PATCH] drm/i915: Inspect subunit states on hangcheck

Fri Jan 8 07:10:51 PST 2016

On Fri, Jan 08, 2016 at 04:54:19PM +0200, Mika Kuoppala wrote:
> Chris Wilson <chris at chris-wilson.co.uk> writes:
> 
> > On Tue, Dec 01, 2015 at 05:56:12PM +0200, Mika Kuoppala wrote:
> >> If head seems stuck and engine in question is rcs,
> >> inspect subunit state transitions from undone to done,
> >> before deciding that this really is a hang instead of limited
> >> progress. Only account the transitions of subunits from
> >> undone to done once, to prevent unstable subunit states
> >> to keep us falsely active.
> >> 
> >> As this adds one extra steps to hangcheck heuristics,
> >> before hang is declared, it adds 1500ms to to detect hang
> >> for render ring to a total of 7500ms. We could sample
> >> the subunit states on first head stuck condition but
> >> decide not to do so only in order to mimic old behaviour. This
> >> way the check order of promotion from seqno > atchd > instdone
> >> is consistently done.
> >> 
> >> v2: Deal with unstable done states (Arun)
> >>     Clear instdone progress on head and seqno movement (Chris)
> >>     Report raw and accumulated instdone's in in debugfs (Chris)
> >>     Return HANGCHECK_ACTIVE on undone->done
> >> 
> >> References: https://bugs.freedesktop.org/show_bug.cgi?id=93029
> >> Cc: Chris Wilson <chris at chris-wilson.co.uk>
> >> Cc: Dave Gordon <david.s.gordon at intel.com>
> >> Cc: Daniel Vetter <daniel at ffwll.ch>
> >> Cc: Arun Siluvery <arun.siluvery at linux.intel.com>
> >> Signed-off-by: Mika Kuoppala <mika.kuoppala at intel.com>
> >
> > I feel slightly dubious in discarding the 1->0 transitions (as it just
> > means that a shared function that was previously idle is now in use
> > again), but if they truly do fluctuate randomly? then accumulating
> > should mean we eventually escape.
> >
> > Reviewed-by: Chris Wilson <chris at chris-wilson.co.uk>
> 
> Queued for -next, thanks for the review. 

Hmm, you just reminded me that we have a problem with HEAD running wild
now as we only detect a loop when it goes past 1<<48 (and we only
increment the score when we loop).

Something like:

diff --git a/drivers/gpu/drm/i915/i915_irq.c b/drivers/gpu/drm/i915/i915_irq.c
index b2ef2d0c211b..4fe28a0301f2 100644
--- a/drivers/gpu/drm/i915/i915_irq.c
+++ b/drivers/gpu/drm/i915/i915_irq.c
@@ -2949,21 +2949,15 @@ static enum intel_engine_hangcheck_action
 head_stuck(struct intel_engine_cs *ring, u64 acthd)
 {
        if (acthd != ring->hangcheck.acthd) {
-
                /* Clear subunit states on head movement */
                memset(ring->hangcheck.instdone, 0,
                       sizeof(ring->hangcheck.instdone));
 
-               if (acthd > ring->hangcheck.max_acthd) {
-                       ring->hangcheck.max_acthd = acthd;
-                       return HANGCHECK_ACTIVE;
-               }
-
                return HANGCHECK_ACTIVE_LOOP;
        }
 
        if (!subunits_stuck(ring))
-               return HANGCHECK_ACTIVE;
+               return HANGCHECK_ACTIVE_LOOP;
 
        return HANGCHECK_HUNG;
 }
@@ -3117,7 +3111,9 @@ static void i915_hangcheck_elapsed(struct work_struct *work)
                         * attempts across multiple batches.
                         */
                        if (ring->hangcheck.score > 0)
-                               ring->hangcheck.score--;
+                               ring->hangcheck.score -= HUNG
+                       if (ring->hangcheck.score < 0)
+                               ring->hangcheck.score = 0;
 
                        /* Clear head and subunit states on seqno movement */
                        ring->hangcheck.acthd = ring->hangcheck.max_acthd = 0;

-Chris

-- 
Chris Wilson, Intel Open Source Technology Centre