[PATCH] radeon: Fix a false positive lockup after 10s of inactivity

Tue Jun 11 16:23:14 PDT 2013

If the device is idle for over ten seconds, then the next attempt to do
anything can race with the lockup detector and cause a bogus lockup
to be detected.

Oddly, the situation is well-described in the lockup detector's comments
and a fix is even described.  This patch implements that fix (and corrects
some typos in the description).

My system has been stable for about a week running this code.  Without this,
my screen would go blank every now and then and, when it came back, everything
would be remarkably slow (the latter is a separate bug).

Signed-off-by: Andy Lutomirski <luto at amacapital.net>
---

This may be -stable material.

 drivers/gpu/drm/radeon/radeon.h      |  1 +
 drivers/gpu/drm/radeon/radeon_ring.c | 23 ++++++++++++++++-------
 2 files changed, 17 insertions(+), 7 deletions(-)

diff --git a/drivers/gpu/drm/radeon/radeon.h b/drivers/gpu/drm/radeon/radeon.h
index 8263af3..9de5778 100644
--- a/drivers/gpu/drm/radeon/radeon.h
+++ b/drivers/gpu/drm/radeon/radeon.h
@@ -652,6 +652,7 @@ struct radeon_ring {
 	unsigned		ring_free_dw;
 	int			count_dw;
 	unsigned long		last_activity;
+	unsigned long		last_test_lockup;
 	unsigned		last_rptr;
 	uint64_t		gpu_addr;
 	uint32_t		align_mask;
diff --git a/drivers/gpu/drm/radeon/radeon_ring.c b/drivers/gpu/drm/radeon/radeon_ring.c
index 1ef5eaa..fb7b3ea 100644
--- a/drivers/gpu/drm/radeon/radeon_ring.c
+++ b/drivers/gpu/drm/radeon/radeon_ring.c
@@ -547,12 +547,12 @@ void radeon_ring_lockup_update(struct radeon_ring *ring)
  * have CP rptr to a different value of jiffies wrap around which will force
  * initialization of the lockup tracking informations.
  *
- * A possible false positivie is if we get call after while and last_cp_rptr ==
- * the current CP rptr, even if it's unlikely it might happen. To avoid this
- * if the elapsed time since last call is bigger than 2 second than we return
- * false and update the tracking information. Due to this the caller must call
- * radeon_ring_test_lockup several time in less than 2sec for lockup to be reported
- * the fencing code should be cautious about that.
+ * A possible false positive is if we get called after a while and
+ * last_cp_rptr == the current CP rptr, even if it's unlikely it might
+ * happen. To avoid this if the elapsed time since the last call is bigger
+ * than 2 second then we return false and update the tracking
+ * information. Due to this the caller must call radeon_ring_test_lockup
+ * more frequently than once every 2s when waiting.
  *
  * Caller should write to the ring to force CP to do something so we don't get
  * false positive when CP is just gived nothing to do.
@@ -560,10 +560,14 @@ void radeon_ring_lockup_update(struct radeon_ring *ring)
  **/
 bool radeon_ring_test_lockup(struct radeon_device *rdev, struct radeon_ring *ring)
 {
-	unsigned long cjiffies, elapsed;
+	unsigned long cjiffies, elapsed, last_test;
 	uint32_t rptr;
 
 	cjiffies = jiffies;
+
+	last_test = ring->last_test_lockup;
+	ring->last_test_lockup = cjiffies;
+
 	if (!time_after(cjiffies, ring->last_activity)) {
 		/* likely a wrap around */
 		radeon_ring_lockup_update(ring);
@@ -576,6 +580,11 @@ bool radeon_ring_test_lockup(struct radeon_device *rdev, struct radeon_ring *rin
 		radeon_ring_lockup_update(ring);
 		return false;
 	}
+	if (cjiffies - last_test > 2 * HZ) {
+		/* Possible race -- see comment above */
+		radeon_ring_lockup_update(ring);
+		return false;
+	}
 	elapsed = jiffies_to_msecs(cjiffies - ring->last_activity);
 	if (radeon_lockup_timeout && elapsed >= radeon_lockup_timeout) {
 		dev_err(rdev->dev, "GPU lockup CP stall for more than %lumsec\n", elapsed);
-- 
1.8.1.4