[Intel-gfx] [PATCH] drm/i915: Prevent runaway head from denying hangcheck
Chris Wilson
chris at chris-wilson.co.uk
Fri Feb 19 14:24:10 UTC 2016
On Fri, Feb 19, 2016 at 04:09:03PM +0200, Mika Kuoppala wrote:
> If we have runaway head moving out of allocated address space,
> that space is mapped to point into scratch page. The content of scratch
> page is is zero (MI_NOOP). This leads to actual head proceeding
> unhindered towards the end of the address space and with with 64 bit
> vmas it is a long walk.
>
> We could inspect ppgtts to see if acthd is on valid space. But
> that would need a lock over active vma list and we have tried very
> hard to keep hangcheck lockfree.
>
> Take note of our current global highest vma address, when objects
> are added to active list. And check against this address when
> hangcheck is run. This is more coarse than per ppgtt inspection,
> but it should work of finding runaway head.
>
> Testcase: igt/drv_hangman/ppgtt_walk
> Cc: Chris Wilson <chris at chris-wilson.co.uk>
> Cc: Daniele Ceraolo Spurio <daniele.ceraolospurio at intel.com>
> Signed-off-by: Mika Kuoppala <mika.kuoppala at intel.com>
> ---
> drivers/gpu/drm/i915/i915_debugfs.c | 2 ++
> drivers/gpu/drm/i915/i915_gem.c | 4 ++++
> drivers/gpu/drm/i915/i915_irq.c | 7 +++++++
> drivers/gpu/drm/i915/intel_ringbuffer.h | 1 +
> 4 files changed, 14 insertions(+)
>
> diff --git a/drivers/gpu/drm/i915/i915_debugfs.c b/drivers/gpu/drm/i915/i915_debugfs.c
> index 9e19cf0e7075..b6735ae32997 100644
> --- a/drivers/gpu/drm/i915/i915_debugfs.c
> +++ b/drivers/gpu/drm/i915/i915_debugfs.c
> @@ -1371,6 +1371,8 @@ static int i915_hangcheck_info(struct seq_file *m, void *unused)
> (long long)acthd[i]);
> seq_printf(m, "\tmax ACTHD = 0x%08llx\n",
> (long long)ring->hangcheck.max_acthd);
> + seq_printf(m, "\tmax vma = 0x%08llx\n",
> + (long long)ring->hangcheck.max_active_vma);
> seq_printf(m, "\tscore = %d\n", ring->hangcheck.score);
> seq_printf(m, "\taction = %d\n", ring->hangcheck.action);
>
> diff --git a/drivers/gpu/drm/i915/i915_gem.c b/drivers/gpu/drm/i915/i915_gem.c
> index f68f34606f2f..331b7a3d4206 100644
> --- a/drivers/gpu/drm/i915/i915_gem.c
> +++ b/drivers/gpu/drm/i915/i915_gem.c
> @@ -2416,6 +2416,10 @@ void i915_vma_move_to_active(struct i915_vma *vma,
> list_move_tail(&obj->ring_list[ring->id], &ring->active_list);
> i915_gem_request_assign(&obj->last_read_req[ring->id], req);
>
> + if (vma->node.start + vma->node.size > ring->hangcheck.max_active_vma)
> + ring->hangcheck.max_active_vma =
> + vma->node.start + vma->node.size;
No.
> +
> list_move_tail(&vma->mm_list, &vma->vm->active_list);
> }
>
> diff --git a/drivers/gpu/drm/i915/i915_irq.c b/drivers/gpu/drm/i915/i915_irq.c
> index 25a89373df63..e59817328971 100644
> --- a/drivers/gpu/drm/i915/i915_irq.c
> +++ b/drivers/gpu/drm/i915/i915_irq.c
> @@ -2996,7 +2996,13 @@ head_stuck(struct intel_engine_cs *ring, u64 acthd)
> sizeof(ring->hangcheck.instdone));
>
> if (acthd > ring->hangcheck.max_acthd) {
> + u64 max_vma = READ_ONCE(ring->hangcheck.max_active_vma);
> +
> ring->hangcheck.max_acthd = acthd;
> +
> + if (max_vma && acthd > max_vma)
> + return HANGCHECK_HUNG;
> +
> return HANGCHECK_ACTIVE;
The issue is that the active-loop detection completely nullifies the DoS
detection, what I thought I sent in Dec/Jan was a tweak to always
increment for loops:
commit 7ea8bea0738d6a2449f1819d8ee6efd402ae7a6c
Author: Chris Wilson <chris at chris-wilson.co.uk>
Date: Fri Jan 8 18:07:39 2016 +0000
drm/i915/hangcheck: Prevent long walks across full-ppgtt
With full-ppgtt, it takes the GPU an eon to traverse the entire 256PiB
address space, so causing a loop to be detected. Under the current
scheme, if ACTHD walks off the end of a batch buffer and into an empty
address space, we "never" detect the hang. If we always increment the
score as the ACTHD is progressing then we will eventually timeout (after
~60s without advancing onto a new batch). To counter act this, increase
the amount we reduce the score for good batches, so that only a series
of almost-bad batches trigger a full reset.
Signed-off-by: Chris Wilson <chris at chris-wilson.co.uk>
diff --git a/drivers/gpu/drm/i915/i915_irq.c b/drivers/gpu/drm/i915/i915_irq.c
index 8e449772b1bb..4fc4fbe4b798 100644
--- a/drivers/gpu/drm/i915/i915_irq.c
+++ b/drivers/gpu/drm/i915/i915_irq.c
@@ -2961,21 +2961,15 @@ static enum intel_engine_hangcheck_action
head_stuck(struct intel_engine_cs *ring, u64 acthd)
{
if (acthd != ring->hangcheck.acthd) {
-
/* Clear subunit states on head movement */
memset(ring->hangcheck.instdone, 0,
sizeof(ring->hangcheck.instdone));
- if (acthd > ring->hangcheck.max_acthd) {
- ring->hangcheck.max_acthd = acthd;
- return HANGCHECK_ACTIVE;
- }
-
return HANGCHECK_ACTIVE_LOOP;
}
if (!subunits_stuck(ring))
- return HANGCHECK_ACTIVE;
+ return HANGCHECK_ACTIVE_LOOP;
return HANGCHECK_HUNG;
}
@@ -3142,7 +3136,9 @@ static void i915_hangcheck_elapsed(struct work_struct *work)
* attempts across multiple batches.
*/
if (ring->hangcheck.score > 0)
- ring->hangcheck.score--;
+ ring->hangcheck.score -= HUNG;
+ if (ring->hangcheck.score < 0)
+ ring->hangcheck.score = 0;
/* Clear head and subunit states on seqno movement */
ring->hangcheck.acthd = ring->hangcheck.max_acthd = 0;
--
Chris Wilson, Intel Open Source Technology Centre
More information about the Intel-gfx
mailing list