[PATCH] drm/ttm: remove fence_lock
Maarten Lankhorst
maarten.lankhorst at canonical.com
Tue Mar 25 07:23:11 PDT 2014
Hey,
op 21-03-14 14:04, Thomas Hellstrom schreef:
> On 03/21/2014 01:12 PM, Maarten Lankhorst wrote:
>> Hey,
>>
>> op 21-03-14 09:27, Thomas Hellstrom schreef:
>>> On 03/21/2014 12:55 AM, Dave Airlie wrote:
>>>> On Fri, Oct 19, 2012 at 3:04 AM, Jerome Glisse <j.glisse at gmail.com>
>>>> wrote:
>>>>> On Thu, Oct 18, 2012 at 06:43:40PM +0200, Thomas Hellstrom wrote:
>>>>>> On 10/18/2012 04:45 PM, Maarten Lankhorst wrote:
>>>>>>> Op 18-10-12 13:55, Thomas Hellstrom schreef:
>>>>>>>> On 10/18/2012 01:38 PM, Maarten Lankhorst wrote:
>>>>>>>>> Op 18-10-12 13:02, Thomas Hellstrom schreef:
>>>>>>>>>> On 10/18/2012 10:37 AM, Maarten Lankhorst wrote:
>>>>>>>>>>> Hey,
>>>>>>>>>>>
>>>>>>>>>>> Op 18-10-12 09:59, Thomas Hellstrom schreef:
>>>>>>>>>>>> On 10/18/2012 09:28 AM, Thomas Hellstrom wrote:
>>>>>>>>>>>>> Hi, Maarten,
>>>>>>>>>>>>>
>>>>>>>>>>>>> As you know I have been having my doubts about this change.
>>>>>>>>>>>>> To me it seems insane to be forced to read the fence
>>>>>>>>>>>>> pointer under a
>>>>>>>>>>>>> reserved lock, simply because when you take the reserve
>>>>>>>>>>>>> lock, another
>>>>>>>>>>>>> process may have it and there is a substantial chance that
>>>>>>>>>>>>> that process
>>>>>>>>>>>>> will also be waiting for idle while holding the reserve lock.
>>>>>>>>>>> I think it makes perfect sense, the only times you want to
>>>>>>>>>>> read the fence
>>>>>>>>>>> is when you want to change the members protected by the
>>>>>>>>>>> reservation.
>>>>>>>>>> No, that's not true. A typical case (or the only case)
>>>>>>>>>> is where you want to do a map with no_wait semantics. You will
>>>>>>>>>> want
>>>>>>>>>> to be able to access a buffer's results even if the eviction code
>>>>>>>>>> is about to schedule an unbind from the GPU, and have the buffer
>>>>>>>>>> reserved?
>>>>>>>>> Well either block on reserve or return -EBUSY if reserved,
>>>>>>>>> presumably the
>>>>>>>>> former..
>>>>>>>>>
>>>>>>>>> ttm_bo_vm_fault does the latter already, anyway
>>>>>>>> ttm_bo_vm_fault only trylocks to avoid a deadlock with mmap_sem,
>>>>>>>> it's really
>>>>>>>> a waiting reserve but for different reasons. Typically a
>>>>>>>> user-space app will keep
>>>>>>>> asynchronous maps to TTM during a buffer lifetime, and the
>>>>>>>> buffer lifetime may
>>>>>>>> be long as user-space apps keep caches.
>>>>>>>> That's not the same as accessing a buffer after the GPU is done
>>>>>>>> with it.
>>>>>>> Ah indeed.
>>>>>>>>> You don't need to hold the reservation while performing the
>>>>>>>>> wait itself though,
>>>>>>>>> you could check if ttm_bo_wait(no_wait_gpu = true) returns
>>>>>>>>> -EBUSY, and if so
>>>>>>>>> take a ref to the sync_obj member and then wait after
>>>>>>>>> unreserving. You won't
>>>>>>>>> reset sync_obj member to NULL in that case, but that should be
>>>>>>>>> harmless.
>>>>>>>>> This will allow you to keep the reservations fast and short.
>>>>>>>>> Maybe a helper
>>>>>>>>> would be appropriate for this since radeon and nouveau both
>>>>>>>>> seem to do this.
>>>>>>>>>
>>>>>>>> The problem is that as long as other users are waiting for idle
>>>>>>>> with reservation
>>>>>>>> held, for example to switch GPU engine or to delete a GPU bind,
>>>>>>>> waiting
>>>>>>>> for reserve will in many case mean wait for GPU.
>>>>>>> This example sounds inefficient, I know nouveau can do this, but
>>>>>>> this essentially
>>>>>>> just stalls the gpu entirely. I think guess you mean functions
>>>>>>> like nouveau_gem_object_close?
>>>>>>> It wouldn't surprise me if performance in nouveau could be
>>>>>>> improved simply by
>>>>>>> fixing those cases up instead, since it stalls the application
>>>>>>> completely too for other uses.
>>>>>>>
>>>>>> Please see the Nouveau cpu_prep patch as well.
>>>>>>
>>>>>> While there are a number of cases that can be fixed up, also in
>>>>>> Radeon, there's no way around that reservation
>>>>>> is a heavyweight lock that, particularly on simpler hardware without
>>>>>> support for fence ordering
>>>>>> with barriers and / or "semaphores" and accelerated eviction will be
>>>>>> held while waiting for idle.
>>>>>>
>>>>>> As such, it is unsuitable to protect read access to the fence
>>>>>> pointer. If the intention is to keep a single fence pointer
>>>>>> I think the best solution is to allow reading the fence pointer
>>>>>> outside reservation, but make sure this can be done
>>>>>> atomically. If the intention is to protect an array or list of fence
>>>>>> pointers, I think a spinlock is the better solution.
>>>>>>
>>>>>> /Thomas
>>>>> Just wanted to point out that like Thomas i am concern about having to
>>>>> have object reserved when waiting on its associated fence. I fear it
>>>>> will hurt us somehow.
>>>>>
>>>>> I will try to spend couple days stress testing your branch on radeon
>>>>> trying to see if it hurts performance anyhow with current use case.
>>>>>
>>>> I've been trying to figure out what to do with Maarten's patches going
>>>> forward since they are essential for all kinds of SoC people,
>>>>
>>>> However I'm still not 100% sure I believe either the fact that the
>>>> problem is anything more than a microoptimisation, and possibly a
>>>> premature one at that, this needs someone to start suggesting
>>>> benchmarks we can run and a driver set to run them on, otherwise I'm
>>>> starting to tend towards we are taking about an optimisation we can
>>>> fix later,
>>>>
>>>> The other option is to somehow merge this stuff under an option that
>>>> allows us to test it using a command line option, but I don't think
>>>> that is sane either,
>>>>
>>>> So Maarten, Jerome, Thomas, please start discussing actual real world
>>>> tests you think merging this code will affect and then I can make a
>>>> better consideration.
>>>>
>>>> Dave.
>>> Dave,
>>>
>>> This is IMHO a fundamental design discussion, not a micro-optimization.
>>>
>>> I'm pretty sure all sorts of horrendous things could be done to the DRM
>>> design without affecting real world application performance.
>>>
>>> In TTM data protection is primarily done with spinlocks: This serves two
>>> purposes.
>>>
>>> a) You can't unnecessarily hold a data protection lock while waiting for
>>> GPU, which is typically a very stupid thing to do (perhaps not so in
>>> this particular case) but when the sleep while atomic or locking
>>> inversion kernel warning turns up, that should at least
>>> ring a bell. Trying to implement dma-buf fencing using the TTM fencing
>>> callbacks will AFAICT cause a locking inversion.
>>>
>>> b) Spinlocks essentially go away on UP systems. The whole reservation
>>> path was essentially lock-free on UP systems pre dma-buf integration,
>>> and with very few atomic locking operations even on SMP systems. It was
>>> all prompted by a test some years ago (by Eric Anholt IIRC) showing that
>>> locking operations turned up quite high on Intel driver profiling.
>>>
>>> If we start protecting data with reservations, when we also export the
>>> reservation locks, so that people find them a convenient "serialize work
>>> on this buffer" lock, all kind of interesting things might happen, and
>>> we might end up in a situation
>>> similar to protecting everything with struct_mutex.
>>>
>>> This is why I dislike this change. In particular when there is a very
>>> simple remedy.
>>>
>>> But if I can get an ACK to convert the reservation object fence pointers
>>> and associated operations on them to be rcu-safe when I have some time
>>> left, I'd be prepared to accept this patch series in it's current state.
>> RCU could be a useful way to deal with this. But in my case I've shown
>> there are very few places where it's needed, core ttm does not need it
>> at all.
>> This is why I wanted to leave it to individual drivers to implement it.
>>
>> I think kfree_rcu for free in the driver itself should be enough,
>> and obtaining in the driver would require something like this, similar
>> to whats in rcuref.txt:
>>
>> rcu_read_lock();
>> f = rcu_dereference(bo->fence);
>> if (f && !kref_get_unless-zero(&f->kref)) f = NULL;
>> rcu_read_unlock();
>>
>> if (f) {
>> // do stuff here
>> ...
>>
>> // free fence
>> kref_put(&f->kref, fence_put_with_kfree_rcu);
>> }
>>
>> Am I wrong or is something like this enough to make sure fence is
>> still alive?
> No, you're correct.
>
> And a bo check for idle would amount to (since we don't care if the
> fence refcount is zero).
>
> rcu_read_lock()
> f = rcu_dereference(bo->fence);
> signaled = !f || f->signaled;
> rcu_read_unlock().
>
> /Thomas
>
This is very easy to implement when there is only 1 fence slot, bo->fence being equal to reservation_object->fence_excl.
It appears to break down when slightly when I implement it on top of shared fences.
My diff is against git://git.linaro.org/people/sumit.semwal/linux-3.x.git for-next-fences
shared_max_fence is held in reservation_object to prevent the reallocation in reservation_object_reserve_shared_fence
every time the same reservation_object gets > 4 shared fences; if it happens once, it's likely going to happen again on the same object.
Anyhow does the below look sane to you? This has only been tested by my compiler, I haven't checked if this boots.
---
commit c73b87d33fd08f7c1b0a381b08b3626128f8f3b8
Author: Maarten Lankhorst <maarten.lankhorst at canonical.com>
Date: Tue Mar 25 15:10:21 2014 +0100
add rcu to fence HACK WIP DO NOT USE KILLS YOUR POT PLANTS PETS AND FAMILY
diff --git a/drivers/base/dma-buf.c b/drivers/base/dma-buf.c
index 96338bf7f457..0a88c10b3db9 100644
--- a/drivers/base/dma-buf.c
+++ b/drivers/base/dma-buf.c
@@ -134,7 +134,10 @@ static unsigned int dma_buf_poll(struct file *file, poll_table *poll)
{
struct dma_buf *dmabuf;
struct reservation_object *resv;
+ struct reservation_object_shared *shared;
+ struct fence *fence_excl;
unsigned long events;
+ unsigned shared_count;
dmabuf = file->private_data;
if (!dmabuf || !dmabuf->resv)
@@ -148,14 +151,18 @@ static unsigned int dma_buf_poll(struct file *file, poll_table *poll)
if (!events)
return 0;
- ww_mutex_lock(&resv->lock, NULL);
+ rcu_read_lock();
- if (resv->fence_excl && (!(events & POLLOUT) ||
- resv->fence_shared_count == 0)) {
+ shared = rcu_dereference(resv->shared);
+ fence_excl = rcu_dereference(resv->fence_excl);
+ shared_count = ACCESS_ONCE(shared->count);
+
+ if (fence_excl && (!(events & POLLOUT) ||
+ (!shared || shared_count == 0))) {
struct dma_buf_poll_cb_t *dcb = &dmabuf->cb_excl;
unsigned long pevents = POLLIN;
- if (resv->fence_shared_count == 0)
+ if (!shared || shared_count == 0)
pevents |= POLLOUT;
spin_lock_irq(&dmabuf->poll.lock);
@@ -167,19 +174,26 @@ static unsigned int dma_buf_poll(struct file *file, poll_table *poll)
spin_unlock_irq(&dmabuf->poll.lock);
if (events & pevents) {
- if (!fence_add_callback(resv->fence_excl,
- &dcb->cb, dma_buf_poll_cb))
+ if (!kref_get_unless_zero(&fence_excl->refcount)) {
+ /* force a recheck */
+ events &= ~pevents;
+ dma_buf_poll_cb(NULL, &dcb->cb);
+ } else if (!fence_add_callback(fence_excl, &dcb->cb,
+ dma_buf_poll_cb)) {
events &= ~pevents;
- else
+ fence_put(fence_excl);
+ } else {
/*
* No callback queued, wake up any additional
* waiters.
*/
+ fence_put(fence_excl);
dma_buf_poll_cb(NULL, &dcb->cb);
+ }
}
}
- if ((events & POLLOUT) && resv->fence_shared_count > 0) {
+ if ((events & POLLOUT) && shared && shared_count > 0) {
struct dma_buf_poll_cb_t *dcb = &dmabuf->cb_shared;
int i;
@@ -194,20 +208,34 @@ static unsigned int dma_buf_poll(struct file *file, poll_table *poll)
if (!(events & POLLOUT))
goto out;
- for (i = 0; i < resv->fence_shared_count; ++i)
- if (!fence_add_callback(resv->fence_shared[i],
- &dcb->cb, dma_buf_poll_cb)) {
+ for (i = 0; i < shared_count; ++i) {
+ struct fence *fence = ACCESS_ONCE(shared->fence[i]);
+ if (!kref_get_unless_zero(&fence->refcount)) {
+ /*
+ * fence refcount dropped to zero, this means
+ * that the shared object has been freed from under us.
+ * call dma_buf_poll_cb and force a recheck!
+ */
events &= ~POLLOUT;
+ dma_buf_poll_cb(NULL, &dcb->cb);
break;
}
+ if (!fence_add_callback(fence, &dcb->cb,
+ dma_buf_poll_cb)) {
+ fence_put(fence);
+ events &= ~POLLOUT;
+ break;
+ }
+ fence_put(fence);
+ }
/* No callback queued, wake up any additional waiters. */
- if (i == resv->fence_shared_count)
+ if (i == shared_count)
dma_buf_poll_cb(NULL, &dcb->cb);
}
out:
- ww_mutex_unlock(&resv->lock);
+ rcu_read_unlock();
return events;
}
diff --git a/drivers/base/fence.c b/drivers/base/fence.c
index 8fff13fb86cf..be03a9e8ad8b 100644
--- a/drivers/base/fence.c
+++ b/drivers/base/fence.c
@@ -170,7 +170,7 @@ void release_fence(struct kref *kref)
if (fence->ops->release)
fence->ops->release(fence);
else
- kfree(fence);
+ kfree_rcu(fence, rcu);
}
EXPORT_SYMBOL(release_fence);
diff --git a/include/linux/fence.h b/include/linux/fence.h
index 65f2a01ee7e4..d19620405c34 100644
--- a/include/linux/fence.h
+++ b/include/linux/fence.h
@@ -40,6 +40,7 @@ struct fence_cb;
* struct fence - software synchronization primitive
* @refcount: refcount for this fence
* @ops: fence_ops associated with this fence
+ * @rcu: used for releasing fence with kfree_rcu
* @cb_list: list of all callbacks to call
* @lock: spin_lock_irqsave used for locking
* @context: execution context this fence belongs to, returned by
@@ -73,6 +74,7 @@ struct fence_cb;
struct fence {
struct kref refcount;
const struct fence_ops *ops;
+ struct rcu_head rcu;
struct list_head cb_list;
spinlock_t *lock;
unsigned context, seqno;
diff --git a/include/linux/reservation.h b/include/linux/reservation.h
index f3f57460a205..91576fabafdb 100644
--- a/include/linux/reservation.h
+++ b/include/linux/reservation.h
@@ -42,6 +42,7 @@
#include <linux/ww_mutex.h>
#include <linux/fence.h>
#include <linux/slab.h>
+#include <linux/rcupdate.h>
extern struct ww_class reservation_ww_class;
@@ -49,8 +50,12 @@ struct reservation_object {
struct ww_mutex lock;
struct fence *fence_excl;
- struct fence **fence_shared;
- u32 fence_shared_count, fence_shared_max;
+ u32 shared_max_fence;
+ struct reservation_object_shared {
+ struct rcu_head rcu;
+ u32 count;
+ struct fence *fence[];
+ } *shared;
};
static inline void
@@ -58,8 +63,8 @@ reservation_object_init(struct reservation_object *obj)
{
ww_mutex_init(&obj->lock, &reservation_ww_class);
- obj->fence_shared_count = obj->fence_shared_max = 0;
- obj->fence_shared = NULL;
+ obj->shared = NULL;
+ obj->shared_max_fence = 4;
obj->fence_excl = NULL;
}
@@ -70,11 +75,97 @@ reservation_object_fini(struct reservation_object *obj)
if (obj->fence_excl)
fence_put(obj->fence_excl);
- for (i = 0; i < obj->fence_shared_count; ++i)
- fence_put(obj->fence_shared[i]);
- kfree(obj->fence_shared);
+ if (obj->shared) {
+ for (i = 0; i < obj->shared->count; ++i)
+ fence_put(obj->shared->fence[i]);
+
+ /*
+ * This object should be dead and all references must have
+ * been released to it, so no need to free with rcu.
+ */
+ kfree(obj->shared);
+ }
ww_mutex_destroy(&obj->lock);
}
+/*
+ * Reserve space to add a shared fence to a reservation_object,
+ * must be called with obj->lock held.
+ */
+static inline int
+reservation_object_reserve_shared_fence(struct reservation_object *obj)
+{
+ struct reservation_object_shared *shared, *old;
+ u32 max = obj->shared_max_fence;
+
+ if (obj->shared) {
+ if (obj->shared->count < max)
+ return 0;
+ max *= 2;
+ }
+
+ shared = kmalloc(offsetof(typeof(*shared), fence[max]), GFP_KERNEL);
+ if (!shared)
+ return -ENOMEM;
+ old = obj->shared;
+
+ if (old) {
+ shared->count = old->count;
+ memcpy(shared->fence, old->fence, old->count * sizeof(*old->fence));
+ } else {
+ shared->count = 0;
+ }
+ rcu_assign_pointer(obj->shared, shared);
+ obj->shared_max_fence = max;
+ kfree_rcu(old, rcu);
+ return 0;
+}
+
+/*
+ * Add a fence to a shared slot, obj->lock must be held, and
+ * reservation_object_reserve_shared_fence has been called.
+ */
+
+static inline void
+reservation_object_add_shared_fence(struct reservation_object *obj,
+ struct fence *fence)
+{
+ unsigned i;
+
+ BUG_ON(obj->shared->count == obj->shared_max_fence);
+ fence_get(fence);
+
+ for (i = 0; i < obj->shared->count; ++i)
+ if (obj->shared->fence[i]->context == fence->context) {
+ struct fence *old = obj->shared->fence[i];
+ rcu_assign_pointer(obj->shared->fence[i], fence);
+ fence_put(old);
+ return;
+ }
+
+ obj->shared->fence[obj->shared->count] = fence;
+ smp_wmb();
+ obj->shared->count++;
+}
+
+/*
+ * May be called after adding an exclusive to wipe all shared fences.
+ */
+
+static inline void
+reservation_object_clear_shared(struct reservation_object *obj)
+{
+ struct reservation_object_shared *old = obj->shared;
+ unsigned i;
+
+ if (!old)
+ return;
+
+ rcu_assign_pointer(obj->shared, NULL);
+ for (i = 0; i < old->count; ++i)
+ fence_put(old->fence[i]);
+ kfree_rcu(old, rcu);
+}
+
#endif /* _LINUX_RESERVATION_H */
More information about the dri-devel
mailing list