[Intel-gfx] [PATCH 2/6] drm/i915: Decouple hang detection from hangcheck period

Mika Kuoppala mika.kuoppala at linux.intel.com
Tue Nov 15 17:23:26 UTC 2016


Chris Wilson <chris at chris-wilson.co.uk> writes:

> On Tue, Nov 15, 2016 at 04:36:32PM +0200, Mika Kuoppala wrote:
>> +static bool
>> +hangcheck_engine_stall(struct intel_engine_cs *engine,
>> +		       struct intel_engine_hangcheck *hc)
>> +{
>> +	const unsigned long last_action = engine->hangcheck.action_timestamp;
>>  
>> -		memset(&engine->hangcheck.instdone, 0,
>> -		       sizeof(engine->hangcheck.instdone));
>> -		break;
>> +	if (hc->action == HANGCHECK_ACTIVE_SEQNO ||
>> +	    hc->action == HANGCHECK_IDLE)
>> +		return false;
>> +
>> +	if (time_before(jiffies, last_action + HANGCHECK_HUNG_JIFFIES))
>> +		return false;
>> +
>> +	if (time_before(jiffies, last_action + HANGCHECK_STUCK_JIFFIES))
>> +		if (hc->action != HANGCHECK_HUNG)
>> +			return false;
>
> This deserves a lot more explanation. Why two values? What are their
> significance?
>

#define DRM_I915_STUCK_PERIOD_SEC 24 /* No observed seqno progress */
#define DRM_I915_HUNG_PERIOD_SEC 4 /* No observed seqno nor head progress */

We allow bigger timeout if head is moving inside a request. THus two
values. I tried to mimic the behaviour from old code.

> Do we want to be using jiffies? (Values are in seconds, so jiffie
> resoluation makes sense, but add that as a comment somewhere.)

The defines as seconds and the subsequent jiffie counterparts in
i915_drv.h not enough?

>
>> +	return true;
>> +}
>> +
>> +static struct intel_engine_cs *find_lra_engine(struct drm_i915_private *i915,
>> +					       const unsigned int mask)
>
> What is lra?
>

Least recently active. I will opencode the name.

>> +{
>> +	struct intel_engine_cs *engine = NULL, *c;
>> +	enum intel_engine_id id;
>> +
>> +	for_each_engine_masked(c, i915, mask, id) {
>> +		if (engine == NULL) {
>> +			engine = c;
>> +			continue;
>> +		}
>> +
>> +		if (time_before(c->hangcheck.action_timestamp,
>> +				engine->hangcheck.action_timestamp))
>> +			engine = c;
>> +		else if (c->hangcheck.action_timestamp ==
>> +			 engine->hangcheck.action_timestamp &&
>> +			 c->hangcheck.seqno < engine->hangcheck.seqno)
>> +			engine = c;
>
> Why is the last engine significant?

We blame the engine which had work and was inactive for the
longest amount of time. 

> Why is just engine guilty?

Just one? As the stamps are valid cross reset, the next
advacement after this hang will make progress show
on this engine, and the next engine will get caught.

The hangcheck action is a filter to weed out those engines
that can't possibly be culprit for hang.

So we aim to pinpoint only one engine with accuracy and
reset that to lean towards per engine resets.

-Mika

>
>
> Too many whys in this one.
> -Chris
>
> -- 
> Chris Wilson, Intel Open Source Technology Centre


More information about the Intel-gfx mailing list