[Intel-gfx] [PATCH 2/6] drm/i915: Decouple hang detection from hangcheck period

Tue Nov 15 21:32:45 UTC 2016

On Tue, Nov 15, 2016 at 07:23:26PM +0200, Mika Kuoppala wrote:
> Chris Wilson <chris at chris-wilson.co.uk> writes:
> 
> > On Tue, Nov 15, 2016 at 04:36:32PM +0200, Mika Kuoppala wrote:
> >> +static bool
> >> +hangcheck_engine_stall(struct intel_engine_cs *engine,
> >> +		       struct intel_engine_hangcheck *hc)
> >> +{
> >> +	const unsigned long last_action = engine->hangcheck.action_timestamp;
> >>  
> >> -		memset(&engine->hangcheck.instdone, 0,
> >> -		       sizeof(engine->hangcheck.instdone));
> >> -		break;
> >> +	if (hc->action == HANGCHECK_ACTIVE_SEQNO ||
> >> +	    hc->action == HANGCHECK_IDLE)
> >> +		return false;
> >> +
> >> +	if (time_before(jiffies, last_action + HANGCHECK_HUNG_JIFFIES))
> >> +		return false;
> >> +
> >> +	if (time_before(jiffies, last_action + HANGCHECK_STUCK_JIFFIES))
> >> +		if (hc->action != HANGCHECK_HUNG)
> >> +			return false;
> >
> > This deserves a lot more explanation. Why two values? What are their
> > significance?
> >
> 
> #define DRM_I915_STUCK_PERIOD_SEC 24 /* No observed seqno progress */
> #define DRM_I915_HUNG_PERIOD_SEC 4 /* No observed seqno nor head progress */

They are very far away and so not helping to explain this code at all.

How about calling them NO_SEQNO_PROGRESS, NO_HEAD_PROGRESS ? And 24s
without a seqno progress is far too long. We set off alarms elsewhere
for much shorter unresponsive code.

> We allow bigger timeout if head is moving inside a request. THus two
> values. I tried to mimic the behaviour from old code.
> 
> > Do we want to be using jiffies? (Values are in seconds, so jiffie
> > resoluation makes sense, but add that as a comment somewhere.)
> 
> The defines as seconds and the subsequent jiffie counterparts in
> i915_drv.h not enough?

I was just thinking how more flexible ktime was in terms of convenience
api.

> 
> >
> >> +	return true;
> >> +}
> >> +
> >> +static struct intel_engine_cs *find_lra_engine(struct drm_i915_private *i915,
> >> +					       const unsigned int mask)
> >
> > What is lra?
> >
> 
> Least recently active. I will opencode the name.
> 
> >> +{
> >> +	struct intel_engine_cs *engine = NULL, *c;
> >> +	enum intel_engine_id id;
> >> +
> >> +	for_each_engine_masked(c, i915, mask, id) {
> >> +		if (engine == NULL) {
> >> +			engine = c;
> >> +			continue;
> >> +		}
> >> +
> >> +		if (time_before(c->hangcheck.action_timestamp,
> >> +				engine->hangcheck.action_timestamp))
> >> +			engine = c;
> >> +		else if (c->hangcheck.action_timestamp ==
> >> +			 engine->hangcheck.action_timestamp &&
> >> +			 c->hangcheck.seqno < engine->hangcheck.seqno)
> >> +			engine = c;
> >
> > Why is the last engine significant?
> 
> We blame the engine which had work and was inactive for the
> longest amount of time. 

This feels like a step backwards.

> > Why is just engine guilty?
> 
> Just one? As the stamps are valid cross reset, the next
> advacement after this hang will make progress show
> on this engine, and the next engine will get caught.
> 
> The hangcheck action is a filter to weed out those engines
> that can't possibly be culprit for hang.

If it hadn't advanced for umpteen seconds, it is not exactly innocenet
either (consider we take care not to blame those waiting on the guilty).

I thought moving forwards we wanted to treat the engines/reset with
greater indepdence whereas this adds global state.
-Chris

-- 
Chris Wilson, Intel Open Source Technology Centre