[Intel-gfx] [PATCH] drm/i915: Ignore stuck requests when considering hangs

Mika Kuoppala mika.kuoppala at linux.intel.com
Mon Aug 22 11:39:30 UTC 2016


Chris Wilson <chris at chris-wilson.co.uk> writes:

> If the engine isn't being retired (worker starvation?) then it is
> possible for us to repeatedly observe that between consecutive
> hangchecks the seqno on the ring to be the same and there remain
> unretired requests. Ignore these completely and only regard the engine
> as busy for the purpose of hang detection (not stall detection) if there
> are outstanding breadcrumbs.
>
> In recent history we have looked at using both the request and seqno as
> indication of activity on the engine, but that was reduced to just
> inspecting seqno in commit cffa781e5907 ("drm/i915: Simplify check for
> idleness in hangcheck"). However, in commit dcff85c8443e ("drm/i915:
> Enable i915_gem_wait_for_idle() without holding struct_mutex"), I made
> the decision to use the new common lockless function, under the
> assumption that request retirement was more frequent than hangcheck and
> so we would not have a stuck busy check. The flaw there was in
> forgetting that we accumulate the hang score, and so successive checks
> seeing a stuck request, albeit with the GPU advancing elsewhere and so
> not necessary the same stuck request, would eventually trigger the hang.
>
> Fixes: dcff85c8443e ("drm/i915: Enable i915_gem_wait_for_idle()...")
> Signed-off-by: Chris Wilson <chris at chris-wilson.co.uk>
> Cc: Mika Kuoppala <mika.kuoppala at intel.com>
> ---
>  drivers/gpu/drm/i915/i915_irq.c | 4 +++-
>  1 file changed, 3 insertions(+), 1 deletion(-)
>
> diff --git a/drivers/gpu/drm/i915/i915_irq.c b/drivers/gpu/drm/i915/i915_irq.c
> index ebb83d5a448b..7610eca4f687 100644
> --- a/drivers/gpu/drm/i915/i915_irq.c
> +++ b/drivers/gpu/drm/i915/i915_irq.c
> @@ -3079,6 +3079,7 @@ static void i915_hangcheck_elapsed(struct work_struct *work)
>  		bool busy = intel_engine_has_waiter(engine);
>  		u64 acthd;
>  		u32 seqno;
> +		u32 submit;
>  
>  		semaphore_clear_deadlocks(dev_priv);
>  
> @@ -3094,9 +3095,10 @@ static void i915_hangcheck_elapsed(struct work_struct *work)
>  
>  		acthd = intel_engine_get_active_head(engine);
>  		seqno = intel_engine_get_seqno(engine);
> +		submit = READ_ONCE(engine->last_submitted_seqno);
>  
>  		if (engine->hangcheck.seqno == seqno) {
> -			if (!intel_engine_is_active(engine)) {
> +			if (i915_seqno_passed(seqno, submit)) {

Setting of busy could be moved in the in scope.

Also the check could be seqno == submit and warning if we see
seqno on engine past submit.

But the patch fixes what it says it does,

Reviewed-by: Mika Kuoppala <mika.kuoppala at intel.com>

>  				engine->hangcheck.action = HANGCHECK_IDLE;
>  				if (busy) {
>  					/* Safeguard against driver failure */
> -- 
> 2.9.3


More information about the Intel-gfx mailing list