[Intel-gfx] [PATCH] drm/i915: Inspect subunit states on hangcheck
Chris Wilson
chris at chris-wilson.co.uk
Fri Jan 8 07:10:51 PST 2016
On Fri, Jan 08, 2016 at 04:54:19PM +0200, Mika Kuoppala wrote:
> Chris Wilson <chris at chris-wilson.co.uk> writes:
>
> > On Tue, Dec 01, 2015 at 05:56:12PM +0200, Mika Kuoppala wrote:
> >> If head seems stuck and engine in question is rcs,
> >> inspect subunit state transitions from undone to done,
> >> before deciding that this really is a hang instead of limited
> >> progress. Only account the transitions of subunits from
> >> undone to done once, to prevent unstable subunit states
> >> to keep us falsely active.
> >>
> >> As this adds one extra steps to hangcheck heuristics,
> >> before hang is declared, it adds 1500ms to to detect hang
> >> for render ring to a total of 7500ms. We could sample
> >> the subunit states on first head stuck condition but
> >> decide not to do so only in order to mimic old behaviour. This
> >> way the check order of promotion from seqno > atchd > instdone
> >> is consistently done.
> >>
> >> v2: Deal with unstable done states (Arun)
> >> Clear instdone progress on head and seqno movement (Chris)
> >> Report raw and accumulated instdone's in in debugfs (Chris)
> >> Return HANGCHECK_ACTIVE on undone->done
> >>
> >> References: https://bugs.freedesktop.org/show_bug.cgi?id=93029
> >> Cc: Chris Wilson <chris at chris-wilson.co.uk>
> >> Cc: Dave Gordon <david.s.gordon at intel.com>
> >> Cc: Daniel Vetter <daniel at ffwll.ch>
> >> Cc: Arun Siluvery <arun.siluvery at linux.intel.com>
> >> Signed-off-by: Mika Kuoppala <mika.kuoppala at intel.com>
> >
> > I feel slightly dubious in discarding the 1->0 transitions (as it just
> > means that a shared function that was previously idle is now in use
> > again), but if they truly do fluctuate randomly? then accumulating
> > should mean we eventually escape.
> >
> > Reviewed-by: Chris Wilson <chris at chris-wilson.co.uk>
>
> Queued for -next, thanks for the review.
Hmm, you just reminded me that we have a problem with HEAD running wild
now as we only detect a loop when it goes past 1<<48 (and we only
increment the score when we loop).
Something like:
diff --git a/drivers/gpu/drm/i915/i915_irq.c b/drivers/gpu/drm/i915/i915_irq.c
index b2ef2d0c211b..4fe28a0301f2 100644
--- a/drivers/gpu/drm/i915/i915_irq.c
+++ b/drivers/gpu/drm/i915/i915_irq.c
@@ -2949,21 +2949,15 @@ static enum intel_engine_hangcheck_action
head_stuck(struct intel_engine_cs *ring, u64 acthd)
{
if (acthd != ring->hangcheck.acthd) {
-
/* Clear subunit states on head movement */
memset(ring->hangcheck.instdone, 0,
sizeof(ring->hangcheck.instdone));
- if (acthd > ring->hangcheck.max_acthd) {
- ring->hangcheck.max_acthd = acthd;
- return HANGCHECK_ACTIVE;
- }
-
return HANGCHECK_ACTIVE_LOOP;
}
if (!subunits_stuck(ring))
- return HANGCHECK_ACTIVE;
+ return HANGCHECK_ACTIVE_LOOP;
return HANGCHECK_HUNG;
}
@@ -3117,7 +3111,9 @@ static void i915_hangcheck_elapsed(struct work_struct *work)
* attempts across multiple batches.
*/
if (ring->hangcheck.score > 0)
- ring->hangcheck.score--;
+ ring->hangcheck.score -= HUNG
+ if (ring->hangcheck.score < 0)
+ ring->hangcheck.score = 0;
/* Clear head and subunit states on seqno movement */
ring->hangcheck.acthd = ring->hangcheck.max_acthd = 0;
-Chris
--
Chris Wilson, Intel Open Source Technology Centre
More information about the Intel-gfx
mailing list