[PATCH] drm/ttm: remove fence_lock

Maarten Lankhorst maarten.lankhorst at canonical.com
Tue Mar 25 07:23:11 PDT 2014


Hey,

op 21-03-14 14:04, Thomas Hellstrom schreef:
> On 03/21/2014 01:12 PM, Maarten Lankhorst wrote:
>> Hey,
>>
>> op 21-03-14 09:27, Thomas Hellstrom schreef:
>>> On 03/21/2014 12:55 AM, Dave Airlie wrote:
>>>> On Fri, Oct 19, 2012 at 3:04 AM, Jerome Glisse <j.glisse at gmail.com>
>>>> wrote:
>>>>> On Thu, Oct 18, 2012 at 06:43:40PM +0200, Thomas Hellstrom wrote:
>>>>>> On 10/18/2012 04:45 PM, Maarten Lankhorst wrote:
>>>>>>> Op 18-10-12 13:55, Thomas Hellstrom schreef:
>>>>>>>> On 10/18/2012 01:38 PM, Maarten Lankhorst wrote:
>>>>>>>>> Op 18-10-12 13:02, Thomas Hellstrom schreef:
>>>>>>>>>> On 10/18/2012 10:37 AM, Maarten Lankhorst wrote:
>>>>>>>>>>> Hey,
>>>>>>>>>>>
>>>>>>>>>>> Op 18-10-12 09:59, Thomas Hellstrom schreef:
>>>>>>>>>>>> On 10/18/2012 09:28 AM, Thomas Hellstrom wrote:
>>>>>>>>>>>>> Hi, Maarten,
>>>>>>>>>>>>>
>>>>>>>>>>>>> As you know I have been having my doubts about this change.
>>>>>>>>>>>>> To me it seems insane to be forced to read the fence
>>>>>>>>>>>>> pointer under a
>>>>>>>>>>>>> reserved lock, simply because when you take the reserve
>>>>>>>>>>>>> lock, another
>>>>>>>>>>>>> process may have it and there is a substantial chance that
>>>>>>>>>>>>> that process
>>>>>>>>>>>>> will also be waiting for idle while holding the reserve lock.
>>>>>>>>>>> I think it makes perfect sense, the only times you want to
>>>>>>>>>>> read the fence
>>>>>>>>>>> is when you want to change the members protected by the
>>>>>>>>>>> reservation.
>>>>>>>>>> No, that's not true. A typical case (or the only case)
>>>>>>>>>> is where you want to do a map with no_wait semantics. You will
>>>>>>>>>> want
>>>>>>>>>> to be able to access a buffer's results even if the eviction code
>>>>>>>>>> is about to schedule an unbind from the GPU, and have the buffer
>>>>>>>>>> reserved?
>>>>>>>>> Well either block on reserve or return -EBUSY if reserved,
>>>>>>>>> presumably the
>>>>>>>>> former..
>>>>>>>>>
>>>>>>>>> ttm_bo_vm_fault does the latter already, anyway
>>>>>>>> ttm_bo_vm_fault only trylocks to avoid a deadlock with mmap_sem,
>>>>>>>> it's really
>>>>>>>> a waiting reserve but for different reasons. Typically a
>>>>>>>> user-space app will keep
>>>>>>>> asynchronous maps to TTM during a buffer lifetime, and the
>>>>>>>> buffer lifetime may
>>>>>>>> be long as user-space apps keep caches.
>>>>>>>> That's not the same as accessing a buffer after the GPU is done
>>>>>>>> with it.
>>>>>>> Ah indeed.
>>>>>>>>> You don't need to hold the reservation while performing the
>>>>>>>>> wait itself though,
>>>>>>>>> you could check if ttm_bo_wait(no_wait_gpu = true) returns
>>>>>>>>> -EBUSY, and if so
>>>>>>>>> take a ref to the sync_obj member and then wait after
>>>>>>>>> unreserving. You won't
>>>>>>>>> reset sync_obj member to NULL in that case, but that should be
>>>>>>>>> harmless.
>>>>>>>>> This will allow you to keep the reservations fast and short.
>>>>>>>>> Maybe a helper
>>>>>>>>> would be appropriate for this since radeon and nouveau both
>>>>>>>>> seem to do this.
>>>>>>>>>
>>>>>>>> The problem is that as long as other users are waiting for idle
>>>>>>>> with reservation
>>>>>>>> held, for example to switch GPU engine or to delete a GPU bind,
>>>>>>>> waiting
>>>>>>>> for reserve will in many case mean wait for GPU.
>>>>>>> This example sounds inefficient, I know nouveau can do this, but
>>>>>>> this essentially
>>>>>>> just stalls the gpu entirely. I think guess you mean functions
>>>>>>> like nouveau_gem_object_close?
>>>>>>> It wouldn't surprise me if performance in nouveau could be
>>>>>>> improved simply by
>>>>>>> fixing those cases up instead, since it stalls the application
>>>>>>> completely too for other uses.
>>>>>>>
>>>>>> Please see the Nouveau cpu_prep patch as well.
>>>>>>
>>>>>> While there are a number of cases that can be fixed up, also in
>>>>>> Radeon, there's no way around that reservation
>>>>>> is a heavyweight lock that, particularly on simpler hardware without
>>>>>> support for fence ordering
>>>>>> with barriers and / or "semaphores" and accelerated eviction will be
>>>>>> held while waiting for idle.
>>>>>>
>>>>>> As such, it is unsuitable to protect read access to the fence
>>>>>> pointer. If the intention is to keep a single fence pointer
>>>>>> I think the best solution is to allow reading the fence pointer
>>>>>> outside reservation, but make sure this can be done
>>>>>> atomically. If the intention is to protect an array or list of fence
>>>>>> pointers, I think a spinlock is the better solution.
>>>>>>
>>>>>> /Thomas
>>>>> Just wanted to point out that like Thomas i am concern about having to
>>>>> have object reserved when waiting on its associated fence. I fear it
>>>>> will hurt us somehow.
>>>>>
>>>>> I will try to spend couple days stress testing your branch on radeon
>>>>> trying to see if it hurts performance anyhow with current use case.
>>>>>
>>>> I've been trying to figure out what to do with Maarten's patches going
>>>> forward since they are essential for all kinds of SoC people,
>>>>
>>>> However I'm still not 100% sure I believe either the fact that the
>>>> problem is anything more than a microoptimisation, and possibly a
>>>> premature one at that, this needs someone to start suggesting
>>>> benchmarks we can run and a driver set to run them on, otherwise I'm
>>>> starting to tend towards we are taking about an optimisation we can
>>>> fix later,
>>>>
>>>> The other option is to somehow merge this stuff under an option that
>>>> allows us to test it using a command line option, but I don't think
>>>> that is sane either,
>>>>
>>>> So Maarten, Jerome, Thomas, please start discussing actual real world
>>>> tests you think merging this code will affect and then I can make a
>>>> better consideration.
>>>>
>>>> Dave.
>>> Dave,
>>>
>>> This is IMHO a fundamental design discussion, not a micro-optimization.
>>>
>>> I'm pretty sure all sorts of horrendous things could be done to the DRM
>>> design without affecting real world application performance.
>>>
>>> In TTM data protection is primarily done with spinlocks: This serves two
>>> purposes.
>>>
>>> a) You can't unnecessarily hold a data protection lock while waiting for
>>> GPU, which is typically a very stupid thing to do (perhaps not so in
>>> this particular case) but when the sleep while atomic or locking
>>> inversion kernel warning turns up, that should at least
>>> ring a bell. Trying to implement dma-buf fencing using the TTM fencing
>>> callbacks will AFAICT cause a locking inversion.
>>>
>>> b) Spinlocks essentially go away on UP systems. The whole reservation
>>> path was essentially lock-free on UP systems pre dma-buf integration,
>>> and with very few atomic locking operations even on SMP systems. It was
>>> all prompted by a test some years ago (by Eric Anholt IIRC) showing that
>>> locking operations turned up quite high on Intel driver profiling.
>>>
>>> If we start protecting data with reservations, when we also export the
>>> reservation locks, so that people find them a convenient "serialize work
>>> on this buffer" lock, all kind of interesting things might happen, and
>>> we might end up in a situation
>>> similar to protecting everything with struct_mutex.
>>>
>>> This is why I dislike this change. In particular when there is a very
>>> simple remedy.
>>>
>>> But if I can get an ACK to convert the reservation object fence pointers
>>> and associated operations on them to be rcu-safe when I have some time
>>> left, I'd be prepared to accept this patch series in it's current state.
>> RCU could be a useful way to deal with this. But in my case I've shown
>> there are very few places where it's needed, core ttm does not need it
>> at all.
>> This is why I wanted to leave it to individual drivers to implement it.
>>
>> I think kfree_rcu for free in the driver itself should be enough,
>> and obtaining in the driver would require something like this, similar
>> to whats in rcuref.txt:
>>
>> rcu_read_lock();
>> f = rcu_dereference(bo->fence);
>> if (f && !kref_get_unless-zero(&f->kref)) f = NULL;
>> rcu_read_unlock();
>>
>> if (f) {
>> // do stuff here
>> ...
>>
>> // free fence
>> kref_put(&f->kref, fence_put_with_kfree_rcu);
>> }
>>
>> Am I wrong or is something like this enough to make sure fence is
>> still alive?
> No, you're correct.
>
> And a bo check for idle would amount to (since we don't care if the
> fence refcount is zero).
>
> rcu_read_lock()
> f = rcu_dereference(bo->fence);
> signaled = !f || f->signaled;
> rcu_read_unlock().
>
> /Thomas
>
This is very easy to implement when there is only 1 fence slot, bo->fence being equal to reservation_object->fence_excl.
It appears to break down when slightly when I implement it on top of shared fences.

My diff is against git://git.linaro.org/people/sumit.semwal/linux-3.x.git for-next-fences
shared_max_fence is held in reservation_object to prevent the reallocation in reservation_object_reserve_shared_fence
every time the same reservation_object gets > 4 shared fences; if it happens once, it's likely going to happen again on the same object.

Anyhow does the below look sane to you? This has only been tested by my compiler, I haven't checked if this boots.
---
commit c73b87d33fd08f7c1b0a381b08b3626128f8f3b8
Author: Maarten Lankhorst <maarten.lankhorst at canonical.com>
Date:   Tue Mar 25 15:10:21 2014 +0100

     add rcu to fence HACK WIP DO NOT USE KILLS YOUR POT PLANTS PETS AND FAMILY

diff --git a/drivers/base/dma-buf.c b/drivers/base/dma-buf.c
index 96338bf7f457..0a88c10b3db9 100644
--- a/drivers/base/dma-buf.c
+++ b/drivers/base/dma-buf.c
@@ -134,7 +134,10 @@ static unsigned int dma_buf_poll(struct file *file, poll_table *poll)
  {
  	struct dma_buf *dmabuf;
  	struct reservation_object *resv;
+	struct reservation_object_shared *shared;
+	struct fence *fence_excl;
  	unsigned long events;
+	unsigned shared_count;
  
  	dmabuf = file->private_data;
  	if (!dmabuf || !dmabuf->resv)
@@ -148,14 +151,18 @@ static unsigned int dma_buf_poll(struct file *file, poll_table *poll)
  	if (!events)
  		return 0;
  
-	ww_mutex_lock(&resv->lock, NULL);
+	rcu_read_lock();
  
-	if (resv->fence_excl && (!(events & POLLOUT) ||
-				 resv->fence_shared_count == 0)) {
+	shared = rcu_dereference(resv->shared);
+	fence_excl = rcu_dereference(resv->fence_excl);
+	shared_count = ACCESS_ONCE(shared->count);
+
+	if (fence_excl && (!(events & POLLOUT) ||
+				 (!shared || shared_count == 0))) {
  		struct dma_buf_poll_cb_t *dcb = &dmabuf->cb_excl;
  		unsigned long pevents = POLLIN;
  
-		if (resv->fence_shared_count == 0)
+		if (!shared || shared_count == 0)
  			pevents |= POLLOUT;
  
  		spin_lock_irq(&dmabuf->poll.lock);
@@ -167,19 +174,26 @@ static unsigned int dma_buf_poll(struct file *file, poll_table *poll)
  		spin_unlock_irq(&dmabuf->poll.lock);
  
  		if (events & pevents) {
-			if (!fence_add_callback(resv->fence_excl,
-						&dcb->cb, dma_buf_poll_cb))
+			if (!kref_get_unless_zero(&fence_excl->refcount)) {
+				/* force a recheck */
+				events &= ~pevents;
+				dma_buf_poll_cb(NULL, &dcb->cb);
+			} else if (!fence_add_callback(fence_excl, &dcb->cb,
+						       dma_buf_poll_cb)) {
  				events &= ~pevents;
-			else
+				fence_put(fence_excl);
+			} else {
  				/*
  				 * No callback queued, wake up any additional
  				 * waiters.
  				 */
+				fence_put(fence_excl);
  				dma_buf_poll_cb(NULL, &dcb->cb);
+			}
  		}
  	}
  
-	if ((events & POLLOUT) && resv->fence_shared_count > 0) {
+	if ((events & POLLOUT) && shared && shared_count > 0) {
  		struct dma_buf_poll_cb_t *dcb = &dmabuf->cb_shared;
  		int i;
  
@@ -194,20 +208,34 @@ static unsigned int dma_buf_poll(struct file *file, poll_table *poll)
  		if (!(events & POLLOUT))
  			goto out;
  
-		for (i = 0; i < resv->fence_shared_count; ++i)
-			if (!fence_add_callback(resv->fence_shared[i],
-						&dcb->cb, dma_buf_poll_cb)) {
+		for (i = 0; i < shared_count; ++i) {
+			struct fence *fence = ACCESS_ONCE(shared->fence[i]);
+			if (!kref_get_unless_zero(&fence->refcount)) {
+				/*
+				 * fence refcount dropped to zero, this means
+				 * that the shared object has been freed from under us.
+				 * call dma_buf_poll_cb and force a recheck!
+				 */
  				events &= ~POLLOUT;
+				dma_buf_poll_cb(NULL, &dcb->cb);
  				break;
  			}
+			if (!fence_add_callback(fence, &dcb->cb,
+						dma_buf_poll_cb)) {
+				fence_put(fence);
+				events &= ~POLLOUT;
+				break;
+			}
+			fence_put(fence);
+		}
  
  		/* No callback queued, wake up any additional waiters. */
-		if (i == resv->fence_shared_count)
+		if (i == shared_count)
  			dma_buf_poll_cb(NULL, &dcb->cb);
  	}
  
  out:
-	ww_mutex_unlock(&resv->lock);
+	rcu_read_unlock();
  	return events;
  }
  
diff --git a/drivers/base/fence.c b/drivers/base/fence.c
index 8fff13fb86cf..be03a9e8ad8b 100644
--- a/drivers/base/fence.c
+++ b/drivers/base/fence.c
@@ -170,7 +170,7 @@ void release_fence(struct kref *kref)
  	if (fence->ops->release)
  		fence->ops->release(fence);
  	else
-		kfree(fence);
+		kfree_rcu(fence, rcu);
  }
  EXPORT_SYMBOL(release_fence);
  
diff --git a/include/linux/fence.h b/include/linux/fence.h
index 65f2a01ee7e4..d19620405c34 100644
--- a/include/linux/fence.h
+++ b/include/linux/fence.h
@@ -40,6 +40,7 @@ struct fence_cb;
   * struct fence - software synchronization primitive
   * @refcount: refcount for this fence
   * @ops: fence_ops associated with this fence
+ * @rcu: used for releasing fence with kfree_rcu
   * @cb_list: list of all callbacks to call
   * @lock: spin_lock_irqsave used for locking
   * @context: execution context this fence belongs to, returned by
@@ -73,6 +74,7 @@ struct fence_cb;
  struct fence {
  	struct kref refcount;
  	const struct fence_ops *ops;
+	struct rcu_head rcu;
  	struct list_head cb_list;
  	spinlock_t *lock;
  	unsigned context, seqno;
diff --git a/include/linux/reservation.h b/include/linux/reservation.h
index f3f57460a205..91576fabafdb 100644
--- a/include/linux/reservation.h
+++ b/include/linux/reservation.h
@@ -42,6 +42,7 @@
  #include <linux/ww_mutex.h>
  #include <linux/fence.h>
  #include <linux/slab.h>
+#include <linux/rcupdate.h>
  
  extern struct ww_class reservation_ww_class;
  
@@ -49,8 +50,12 @@ struct reservation_object {
  	struct ww_mutex lock;
  
  	struct fence *fence_excl;
-	struct fence **fence_shared;
-	u32 fence_shared_count, fence_shared_max;
+	u32 shared_max_fence;
+	struct reservation_object_shared {
+		struct rcu_head rcu;
+		u32 count;
+		struct fence *fence[];
+	} *shared;
  };
  
  static inline void
@@ -58,8 +63,8 @@ reservation_object_init(struct reservation_object *obj)
  {
  	ww_mutex_init(&obj->lock, &reservation_ww_class);
  
-	obj->fence_shared_count = obj->fence_shared_max = 0;
-	obj->fence_shared = NULL;
+	obj->shared = NULL;
+	obj->shared_max_fence = 4;
  	obj->fence_excl = NULL;
  }
  
@@ -70,11 +75,97 @@ reservation_object_fini(struct reservation_object *obj)
  
  	if (obj->fence_excl)
  		fence_put(obj->fence_excl);
-	for (i = 0; i < obj->fence_shared_count; ++i)
-		fence_put(obj->fence_shared[i]);
-	kfree(obj->fence_shared);
+	if (obj->shared) {
+		for (i = 0; i < obj->shared->count; ++i)
+			fence_put(obj->shared->fence[i]);
+
+		/*
+		 * This object should be dead and all references must have
+		 * been released to it, so no need to free with rcu.
+		 */
+		kfree(obj->shared);
+	}
  
  	ww_mutex_destroy(&obj->lock);
  }
  
+/*
+ * Reserve space to add a shared fence to a reservation_object,
+ * must be called with obj->lock held.
+ */
+static inline int
+reservation_object_reserve_shared_fence(struct reservation_object *obj)
+{
+	struct reservation_object_shared *shared, *old;
+	u32 max = obj->shared_max_fence;
+
+	if (obj->shared) {
+		if (obj->shared->count < max)
+			return 0;
+		max *= 2;
+	}
+
+	shared = kmalloc(offsetof(typeof(*shared), fence[max]), GFP_KERNEL);
+	if (!shared)
+		return -ENOMEM;
+	old = obj->shared;
+
+	if (old) {
+		shared->count = old->count;
+		memcpy(shared->fence, old->fence, old->count * sizeof(*old->fence));
+	} else {
+		shared->count = 0;
+	}
+	rcu_assign_pointer(obj->shared, shared);
+	obj->shared_max_fence = max;
+	kfree_rcu(old, rcu);
+	return 0;
+}
+
+/*
+ * Add a fence to a shared slot, obj->lock must be held, and
+ * reservation_object_reserve_shared_fence has been called.
+ */
+
+static inline void
+reservation_object_add_shared_fence(struct reservation_object *obj,
+				    struct fence *fence)
+{
+	unsigned i;
+
+	BUG_ON(obj->shared->count == obj->shared_max_fence);
+	fence_get(fence);
+
+	for (i = 0; i < obj->shared->count; ++i)
+		if (obj->shared->fence[i]->context == fence->context) {
+			struct fence *old = obj->shared->fence[i];
+			rcu_assign_pointer(obj->shared->fence[i], fence);
+			fence_put(old);
+			return;
+		}
+
+	obj->shared->fence[obj->shared->count] = fence;
+	smp_wmb();
+	obj->shared->count++;
+}
+
+/*
+ * May be called after adding an exclusive to wipe all shared fences.
+ */
+
+static inline void
+reservation_object_clear_shared(struct reservation_object *obj)
+{
+	struct reservation_object_shared *old = obj->shared;
+	unsigned i;
+
+	if (!old)
+		return;
+
+	rcu_assign_pointer(obj->shared, NULL);
+	for (i = 0; i < old->count; ++i)
+		fence_put(old->fence[i]);
+	kfree_rcu(old, rcu);
+}
+
  #endif /* _LINUX_RESERVATION_H */



More information about the dri-devel mailing list