[PATCH 1/1] drm/i915/active: Fix potential reuse of non-idle barrier as fence tracker

Janusz Krzysztofik janusz.krzysztofik at linux.intel.com
Thu Feb 9 15:31:43 UTC 2023


Barriers are now deleted from a barrier tasks list by temporarily removing
the list content, traversing that content with skip over the node to be
deleted, then adding the modified content back to the list.  Since that
complex operation is not serialized with other concurrent uses of the
list, including similar barrier deletions, functions that depend on the
list being either empty or not empty can take wrong decisions.

On the other hand, when adding a request to a composite tracker, we try to
use an existing tracker already registered with that composite tracker.
The tracker we obtain can already track another fence, can be an idle
barrier, or an active barrier.  The first two cases seem easy to handle
and we seem to do that correctly.  In the last case, we attempt to replace
the active barrier with our request.  However, when trying to delete that
barrier from a list of barrier tasks it belongs to, we ignore return value
from that operation, which informs us whether the deletion succeeded or
not, and we proceed even if it fails.

A failed deletion can be a side effect of the way we have it implemented
-- another thread can temporarily empty the list before we do that.  If we
ignore such failure, that other thread can then add our fence tracker back
to the barrier tasks list.  Since the same structure field of the tracker
is used as a list node with both barrier tasks lists and fence callback
lists, list corruption occurs.  However, list corruptions were still
observed when running the user worload on top of an experimental patch
that serialized all operations on barrier tasks lists with a spinlock.
Then, other race scenarios leading to list corruptions must exist.

A failed deletion can also mean that another thread managed to acquire our
selected tracker for its fence tracking, after we verified that it was an
active barrier but before we attempted to delete it from the barriers list
ourselves -- we don't serialize such concurrent complex operations.  Since
we ignore the failure, we can get a tracker that either already is, or is
not yet tracking a foreign fence.  In the former case we risk overwriting
the traker's pointer to that fence with NULL while marking the barier as
idle, then not processing that foreign fence as required when replacing a
tracked fence, including broken handling of the tracker's membership in
fence callback lists.  In the latter case, the other thread can
potentially corrupt the list membership in a similar way.

If we start respecting the return code from barrier deletion, which seems
required for effectively fixing the issue, we need to handle carefully
those few above described cases.

Respect results of barrier deletion attempts -- mark the barrier as idle
only after successfully deleted from the list.  Then, before proceeding
with setting our fence as the one currently tracked, make sure that the
tracker we've got is not a non-idle barrier, and try to acquire another
one if that check fails.

Signed-off-by: Janusz Krzysztofik <janusz.krzysztofik at linux.intel.com>
---
 drivers/gpu/drm/i915/i915_active.c | 25 ++++++++++++++-----------
 1 file changed, 14 insertions(+), 11 deletions(-)

diff --git a/drivers/gpu/drm/i915/i915_active.c b/drivers/gpu/drm/i915/i915_active.c
index 7412abf166a8c..f9282b8c87c1c 100644
--- a/drivers/gpu/drm/i915/i915_active.c
+++ b/drivers/gpu/drm/i915/i915_active.c
@@ -422,12 +422,12 @@ replace_barrier(struct i915_active *ref, struct i915_active_fence *active)
 	 * we can use it to substitute for the pending idle-barrer
 	 * request that we want to emit on the kernel_context.
 	 */
-	__active_del_barrier(ref, node_from_active(active));
-	return true;
+	return __active_del_barrier(ref, node_from_active(active));
 }
 
 int i915_active_add_request(struct i915_active *ref, struct i915_request *rq)
 {
+	u64 idx = i915_request_timeline(rq)->fence_context;
 	struct dma_fence *fence = &rq->fence;
 	struct i915_active_fence *active;
 	int err;
@@ -437,16 +437,19 @@ int i915_active_add_request(struct i915_active *ref, struct i915_request *rq)
 	if (err)
 		return err;
 
-	active = active_instance(ref, i915_request_timeline(rq)->fence_context);
-	if (!active) {
-		err = -ENOMEM;
-		goto out;
-	}
+	do {
+		active = active_instance(ref, idx);
+		if (!active) {
+			err = -ENOMEM;
+			goto out;
+		}
+
+		if (replace_barrier(ref, active)) {
+			RCU_INIT_POINTER(active->fence, NULL);
+			atomic_dec(&ref->count);
+		}
+	} while (is_barrier(active));
 
-	if (replace_barrier(ref, active)) {
-		RCU_INIT_POINTER(active->fence, NULL);
-		atomic_dec(&ref->count);
-	}
 	if (!__i915_active_fence_set(active, fence))
 		__i915_active_acquire(ref);
 
-- 
2.25.1



More information about the Intel-gfx-trybot mailing list