[Intel-gfx] [PATCH] drm/i915: Prevent runaway head from denying hangcheck

Fri Feb 19 14:24:10 UTC 2016

On Fri, Feb 19, 2016 at 04:09:03PM +0200, Mika Kuoppala wrote:
> If we have runaway head moving out of allocated address space,
> that space is mapped to point into scratch page. The content of scratch
> page is is zero (MI_NOOP). This leads to actual head proceeding
> unhindered towards the end of the address space and with with 64 bit
> vmas it is a long walk.
> 
> We could inspect ppgtts to see if acthd is on valid space. But
> that would need a lock over active vma list and we have tried very
> hard to keep hangcheck lockfree.
> 
> Take note of our current global highest vma address, when objects
> are added to active list. And check against this address when
> hangcheck is run. This is more coarse than per ppgtt inspection,
> but it should work of finding runaway head.
> 
> Testcase: igt/drv_hangman/ppgtt_walk
> Cc: Chris Wilson <chris at chris-wilson.co.uk>
> Cc: Daniele Ceraolo Spurio <daniele.ceraolospurio at intel.com>
> Signed-off-by: Mika Kuoppala <mika.kuoppala at intel.com>
> ---
>  drivers/gpu/drm/i915/i915_debugfs.c     | 2 ++
>  drivers/gpu/drm/i915/i915_gem.c         | 4 ++++
>  drivers/gpu/drm/i915/i915_irq.c         | 7 +++++++
>  drivers/gpu/drm/i915/intel_ringbuffer.h | 1 +
>  4 files changed, 14 insertions(+)
> 
> diff --git a/drivers/gpu/drm/i915/i915_debugfs.c b/drivers/gpu/drm/i915/i915_debugfs.c
> index 9e19cf0e7075..b6735ae32997 100644
> --- a/drivers/gpu/drm/i915/i915_debugfs.c
> +++ b/drivers/gpu/drm/i915/i915_debugfs.c
> @@ -1371,6 +1371,8 @@ static int i915_hangcheck_info(struct seq_file *m, void *unused)
>  			   (long long)acthd[i]);
>  		seq_printf(m, "\tmax ACTHD = 0x%08llx\n",
>  			   (long long)ring->hangcheck.max_acthd);
> +		seq_printf(m, "\tmax vma = 0x%08llx\n",
> +			   (long long)ring->hangcheck.max_active_vma);
>  		seq_printf(m, "\tscore = %d\n", ring->hangcheck.score);
>  		seq_printf(m, "\taction = %d\n", ring->hangcheck.action);
>  
> diff --git a/drivers/gpu/drm/i915/i915_gem.c b/drivers/gpu/drm/i915/i915_gem.c
> index f68f34606f2f..331b7a3d4206 100644
> --- a/drivers/gpu/drm/i915/i915_gem.c
> +++ b/drivers/gpu/drm/i915/i915_gem.c
> @@ -2416,6 +2416,10 @@ void i915_vma_move_to_active(struct i915_vma *vma,
>  	list_move_tail(&obj->ring_list[ring->id], &ring->active_list);
>  	i915_gem_request_assign(&obj->last_read_req[ring->id], req);
>  
> +	if (vma->node.start + vma->node.size > ring->hangcheck.max_active_vma)
> +		ring->hangcheck.max_active_vma =
> +			vma->node.start + vma->node.size;

No.

> +
>  	list_move_tail(&vma->mm_list, &vma->vm->active_list);
>  }
>  
> diff --git a/drivers/gpu/drm/i915/i915_irq.c b/drivers/gpu/drm/i915/i915_irq.c
> index 25a89373df63..e59817328971 100644
> --- a/drivers/gpu/drm/i915/i915_irq.c
> +++ b/drivers/gpu/drm/i915/i915_irq.c
> @@ -2996,7 +2996,13 @@ head_stuck(struct intel_engine_cs *ring, u64 acthd)
>  		       sizeof(ring->hangcheck.instdone));
>  
>  		if (acthd > ring->hangcheck.max_acthd) {
> +			u64 max_vma = READ_ONCE(ring->hangcheck.max_active_vma);
> +
>  			ring->hangcheck.max_acthd = acthd;
> +
> +			if (max_vma && acthd > max_vma)
> +				return HANGCHECK_HUNG;
> +
>  			return HANGCHECK_ACTIVE;

The issue is that the active-loop detection completely nullifies the DoS
detection, what I thought I sent in Dec/Jan was a tweak to always
increment for loops:

commit 7ea8bea0738d6a2449f1819d8ee6efd402ae7a6c
Author: Chris Wilson <chris at chris-wilson.co.uk>
Date:   Fri Jan 8 18:07:39 2016 +0000

    drm/i915/hangcheck: Prevent long walks across full-ppgtt
    
    With full-ppgtt, it takes the GPU an eon to traverse the entire 256PiB
    address space, so causing a loop to be detected. Under the current
    scheme, if ACTHD walks off the end of a batch buffer and into an empty
    address space, we "never" detect the hang. If we always increment the
    score as the ACTHD is progressing then we will eventually timeout (after
    ~60s without advancing onto a new batch). To counter act this, increase
    the amount we reduce the score for good batches, so that only a series
    of almost-bad batches trigger a full reset.
    
    Signed-off-by: Chris Wilson <chris at chris-wilson.co.uk>

diff --git a/drivers/gpu/drm/i915/i915_irq.c b/drivers/gpu/drm/i915/i915_irq.c
index 8e449772b1bb..4fc4fbe4b798 100644
--- a/drivers/gpu/drm/i915/i915_irq.c
+++ b/drivers/gpu/drm/i915/i915_irq.c
@@ -2961,21 +2961,15 @@ static enum intel_engine_hangcheck_action
 head_stuck(struct intel_engine_cs *ring, u64 acthd)
 {
        if (acthd != ring->hangcheck.acthd) {
-
                /* Clear subunit states on head movement */
                memset(ring->hangcheck.instdone, 0,
                       sizeof(ring->hangcheck.instdone));
 
-               if (acthd > ring->hangcheck.max_acthd) {
-                       ring->hangcheck.max_acthd = acthd;
-                       return HANGCHECK_ACTIVE;
-               }
-
                return HANGCHECK_ACTIVE_LOOP;
        }
 
        if (!subunits_stuck(ring))
-               return HANGCHECK_ACTIVE;
+               return HANGCHECK_ACTIVE_LOOP;
 
        return HANGCHECK_HUNG;
 }
@@ -3142,7 +3136,9 @@ static void i915_hangcheck_elapsed(struct work_struct *work)
                         * attempts across multiple batches.
                         */
                        if (ring->hangcheck.score > 0)
-                               ring->hangcheck.score--;
+                               ring->hangcheck.score -= HUNG;
+                       if (ring->hangcheck.score < 0)
+                               ring->hangcheck.score = 0;
 
                        /* Clear head and subunit states on seqno movement */
                        ring->hangcheck.acthd = ring->hangcheck.max_acthd = 0;

-- 
Chris Wilson, Intel Open Source Technology Centre