[Intel-gfx] [PATCH] drm/i915: Ignore stuck requests when considering hangs
Mika Kuoppala
mika.kuoppala at linux.intel.com
Mon Aug 22 11:39:30 UTC 2016
Chris Wilson <chris at chris-wilson.co.uk> writes:
> If the engine isn't being retired (worker starvation?) then it is
> possible for us to repeatedly observe that between consecutive
> hangchecks the seqno on the ring to be the same and there remain
> unretired requests. Ignore these completely and only regard the engine
> as busy for the purpose of hang detection (not stall detection) if there
> are outstanding breadcrumbs.
>
> In recent history we have looked at using both the request and seqno as
> indication of activity on the engine, but that was reduced to just
> inspecting seqno in commit cffa781e5907 ("drm/i915: Simplify check for
> idleness in hangcheck"). However, in commit dcff85c8443e ("drm/i915:
> Enable i915_gem_wait_for_idle() without holding struct_mutex"), I made
> the decision to use the new common lockless function, under the
> assumption that request retirement was more frequent than hangcheck and
> so we would not have a stuck busy check. The flaw there was in
> forgetting that we accumulate the hang score, and so successive checks
> seeing a stuck request, albeit with the GPU advancing elsewhere and so
> not necessary the same stuck request, would eventually trigger the hang.
>
> Fixes: dcff85c8443e ("drm/i915: Enable i915_gem_wait_for_idle()...")
> Signed-off-by: Chris Wilson <chris at chris-wilson.co.uk>
> Cc: Mika Kuoppala <mika.kuoppala at intel.com>
> ---
> drivers/gpu/drm/i915/i915_irq.c | 4 +++-
> 1 file changed, 3 insertions(+), 1 deletion(-)
>
> diff --git a/drivers/gpu/drm/i915/i915_irq.c b/drivers/gpu/drm/i915/i915_irq.c
> index ebb83d5a448b..7610eca4f687 100644
> --- a/drivers/gpu/drm/i915/i915_irq.c
> +++ b/drivers/gpu/drm/i915/i915_irq.c
> @@ -3079,6 +3079,7 @@ static void i915_hangcheck_elapsed(struct work_struct *work)
> bool busy = intel_engine_has_waiter(engine);
> u64 acthd;
> u32 seqno;
> + u32 submit;
>
> semaphore_clear_deadlocks(dev_priv);
>
> @@ -3094,9 +3095,10 @@ static void i915_hangcheck_elapsed(struct work_struct *work)
>
> acthd = intel_engine_get_active_head(engine);
> seqno = intel_engine_get_seqno(engine);
> + submit = READ_ONCE(engine->last_submitted_seqno);
>
> if (engine->hangcheck.seqno == seqno) {
> - if (!intel_engine_is_active(engine)) {
> + if (i915_seqno_passed(seqno, submit)) {
Setting of busy could be moved in the in scope.
Also the check could be seqno == submit and warning if we see
seqno on engine past submit.
But the patch fixes what it says it does,
Reviewed-by: Mika Kuoppala <mika.kuoppala at intel.com>
> engine->hangcheck.action = HANGCHECK_IDLE;
> if (busy) {
> /* Safeguard against driver failure */
> --
> 2.9.3
More information about the Intel-gfx
mailing list