[PATCH] radeon: Fix a false positive lockup after 10s of inactivity
Andy Lutomirski
luto at amacapital.net
Tue Jun 11 16:23:14 PDT 2013
If the device is idle for over ten seconds, then the next attempt to do
anything can race with the lockup detector and cause a bogus lockup
to be detected.
Oddly, the situation is well-described in the lockup detector's comments
and a fix is even described. This patch implements that fix (and corrects
some typos in the description).
My system has been stable for about a week running this code. Without this,
my screen would go blank every now and then and, when it came back, everything
would be remarkably slow (the latter is a separate bug).
Signed-off-by: Andy Lutomirski <luto at amacapital.net>
---
This may be -stable material.
drivers/gpu/drm/radeon/radeon.h | 1 +
drivers/gpu/drm/radeon/radeon_ring.c | 23 ++++++++++++++++-------
2 files changed, 17 insertions(+), 7 deletions(-)
diff --git a/drivers/gpu/drm/radeon/radeon.h b/drivers/gpu/drm/radeon/radeon.h
index 8263af3..9de5778 100644
--- a/drivers/gpu/drm/radeon/radeon.h
+++ b/drivers/gpu/drm/radeon/radeon.h
@@ -652,6 +652,7 @@ struct radeon_ring {
unsigned ring_free_dw;
int count_dw;
unsigned long last_activity;
+ unsigned long last_test_lockup;
unsigned last_rptr;
uint64_t gpu_addr;
uint32_t align_mask;
diff --git a/drivers/gpu/drm/radeon/radeon_ring.c b/drivers/gpu/drm/radeon/radeon_ring.c
index 1ef5eaa..fb7b3ea 100644
--- a/drivers/gpu/drm/radeon/radeon_ring.c
+++ b/drivers/gpu/drm/radeon/radeon_ring.c
@@ -547,12 +547,12 @@ void radeon_ring_lockup_update(struct radeon_ring *ring)
* have CP rptr to a different value of jiffies wrap around which will force
* initialization of the lockup tracking informations.
*
- * A possible false positivie is if we get call after while and last_cp_rptr ==
- * the current CP rptr, even if it's unlikely it might happen. To avoid this
- * if the elapsed time since last call is bigger than 2 second than we return
- * false and update the tracking information. Due to this the caller must call
- * radeon_ring_test_lockup several time in less than 2sec for lockup to be reported
- * the fencing code should be cautious about that.
+ * A possible false positive is if we get called after a while and
+ * last_cp_rptr == the current CP rptr, even if it's unlikely it might
+ * happen. To avoid this if the elapsed time since the last call is bigger
+ * than 2 second then we return false and update the tracking
+ * information. Due to this the caller must call radeon_ring_test_lockup
+ * more frequently than once every 2s when waiting.
*
* Caller should write to the ring to force CP to do something so we don't get
* false positive when CP is just gived nothing to do.
@@ -560,10 +560,14 @@ void radeon_ring_lockup_update(struct radeon_ring *ring)
**/
bool radeon_ring_test_lockup(struct radeon_device *rdev, struct radeon_ring *ring)
{
- unsigned long cjiffies, elapsed;
+ unsigned long cjiffies, elapsed, last_test;
uint32_t rptr;
cjiffies = jiffies;
+
+ last_test = ring->last_test_lockup;
+ ring->last_test_lockup = cjiffies;
+
if (!time_after(cjiffies, ring->last_activity)) {
/* likely a wrap around */
radeon_ring_lockup_update(ring);
@@ -576,6 +580,11 @@ bool radeon_ring_test_lockup(struct radeon_device *rdev, struct radeon_ring *rin
radeon_ring_lockup_update(ring);
return false;
}
+ if (cjiffies - last_test > 2 * HZ) {
+ /* Possible race -- see comment above */
+ radeon_ring_lockup_update(ring);
+ return false;
+ }
elapsed = jiffies_to_msecs(cjiffies - ring->last_activity);
if (radeon_lockup_timeout && elapsed >= radeon_lockup_timeout) {
dev_err(rdev->dev, "GPU lockup CP stall for more than %lumsec\n", elapsed);
--
1.8.1.4
More information about the dri-devel
mailing list