[Intel-gfx] [PATCH] drm/i915: Prevent runaway head from denying hangcheck

Mika Kuoppala mika.kuoppala at linux.intel.com
Fri Feb 19 14:09:03 UTC 2016


If we have runaway head moving out of allocated address space,
that space is mapped to point into scratch page. The content of scratch
page is is zero (MI_NOOP). This leads to actual head proceeding
unhindered towards the end of the address space and with with 64 bit
vmas it is a long walk.

We could inspect ppgtts to see if acthd is on valid space. But
that would need a lock over active vma list and we have tried very
hard to keep hangcheck lockfree.

Take note of our current global highest vma address, when objects
are added to active list. And check against this address when
hangcheck is run. This is more coarse than per ppgtt inspection,
but it should work of finding runaway head.

Testcase: igt/drv_hangman/ppgtt_walk
Cc: Chris Wilson <chris at chris-wilson.co.uk>
Cc: Daniele Ceraolo Spurio <daniele.ceraolospurio at intel.com>
Signed-off-by: Mika Kuoppala <mika.kuoppala at intel.com>
---
 drivers/gpu/drm/i915/i915_debugfs.c     | 2 ++
 drivers/gpu/drm/i915/i915_gem.c         | 4 ++++
 drivers/gpu/drm/i915/i915_irq.c         | 7 +++++++
 drivers/gpu/drm/i915/intel_ringbuffer.h | 1 +
 4 files changed, 14 insertions(+)

diff --git a/drivers/gpu/drm/i915/i915_debugfs.c b/drivers/gpu/drm/i915/i915_debugfs.c
index 9e19cf0e7075..b6735ae32997 100644
--- a/drivers/gpu/drm/i915/i915_debugfs.c
+++ b/drivers/gpu/drm/i915/i915_debugfs.c
@@ -1371,6 +1371,8 @@ static int i915_hangcheck_info(struct seq_file *m, void *unused)
 			   (long long)acthd[i]);
 		seq_printf(m, "\tmax ACTHD = 0x%08llx\n",
 			   (long long)ring->hangcheck.max_acthd);
+		seq_printf(m, "\tmax vma = 0x%08llx\n",
+			   (long long)ring->hangcheck.max_active_vma);
 		seq_printf(m, "\tscore = %d\n", ring->hangcheck.score);
 		seq_printf(m, "\taction = %d\n", ring->hangcheck.action);
 
diff --git a/drivers/gpu/drm/i915/i915_gem.c b/drivers/gpu/drm/i915/i915_gem.c
index f68f34606f2f..331b7a3d4206 100644
--- a/drivers/gpu/drm/i915/i915_gem.c
+++ b/drivers/gpu/drm/i915/i915_gem.c
@@ -2416,6 +2416,10 @@ void i915_vma_move_to_active(struct i915_vma *vma,
 	list_move_tail(&obj->ring_list[ring->id], &ring->active_list);
 	i915_gem_request_assign(&obj->last_read_req[ring->id], req);
 
+	if (vma->node.start + vma->node.size > ring->hangcheck.max_active_vma)
+		ring->hangcheck.max_active_vma =
+			vma->node.start + vma->node.size;
+
 	list_move_tail(&vma->mm_list, &vma->vm->active_list);
 }
 
diff --git a/drivers/gpu/drm/i915/i915_irq.c b/drivers/gpu/drm/i915/i915_irq.c
index 25a89373df63..e59817328971 100644
--- a/drivers/gpu/drm/i915/i915_irq.c
+++ b/drivers/gpu/drm/i915/i915_irq.c
@@ -2996,7 +2996,13 @@ head_stuck(struct intel_engine_cs *ring, u64 acthd)
 		       sizeof(ring->hangcheck.instdone));
 
 		if (acthd > ring->hangcheck.max_acthd) {
+			u64 max_vma = READ_ONCE(ring->hangcheck.max_active_vma);
+
 			ring->hangcheck.max_acthd = acthd;
+
+			if (max_vma && acthd > max_vma)
+				return HANGCHECK_HUNG;
+
 			return HANGCHECK_ACTIVE;
 		}
 
@@ -3107,6 +3113,7 @@ static void i915_hangcheck_elapsed(struct work_struct *work)
 		if (ring->hangcheck.seqno == seqno) {
 			if (ring_idle(ring, seqno)) {
 				ring->hangcheck.action = HANGCHECK_IDLE;
+				ring->hangcheck.max_active_vma = 0;
 
 				if (waitqueue_active(&ring->irq_queue)) {
 					/* Issue a wake-up to catch stuck h/w. */
diff --git a/drivers/gpu/drm/i915/intel_ringbuffer.h b/drivers/gpu/drm/i915/intel_ringbuffer.h
index 566b0ae10ce0..9b61d9c6e6d1 100644
--- a/drivers/gpu/drm/i915/intel_ringbuffer.h
+++ b/drivers/gpu/drm/i915/intel_ringbuffer.h
@@ -89,6 +89,7 @@ enum intel_ring_hangcheck_action {
 struct intel_ring_hangcheck {
 	u64 acthd;
 	u64 max_acthd;
+	u64 max_active_vma;
 	u32 seqno;
 	int score;
 	enum intel_ring_hangcheck_action action;
-- 
2.5.0



More information about the Intel-gfx mailing list