[PATCH] drm/ttm: remove fence_lock

Fri Mar 21 06:04:15 PDT 2014

On 03/21/2014 01:12 PM, Maarten Lankhorst wrote:
> Hey,
>
> op 21-03-14 09:27, Thomas Hellstrom schreef:
>> On 03/21/2014 12:55 AM, Dave Airlie wrote:
>>> On Fri, Oct 19, 2012 at 3:04 AM, Jerome Glisse <j.glisse at gmail.com>
>>> wrote:
>>>> On Thu, Oct 18, 2012 at 06:43:40PM +0200, Thomas Hellstrom wrote:
>>>>> On 10/18/2012 04:45 PM, Maarten Lankhorst wrote:
>>>>>> Op 18-10-12 13:55, Thomas Hellstrom schreef:
>>>>>>> On 10/18/2012 01:38 PM, Maarten Lankhorst wrote:
>>>>>>>> Op 18-10-12 13:02, Thomas Hellstrom schreef:
>>>>>>>>> On 10/18/2012 10:37 AM, Maarten Lankhorst wrote:
>>>>>>>>>> Hey,
>>>>>>>>>>
>>>>>>>>>> Op 18-10-12 09:59, Thomas Hellstrom schreef:
>>>>>>>>>>> On 10/18/2012 09:28 AM, Thomas Hellstrom wrote:
>>>>>>>>>>>> Hi, Maarten,
>>>>>>>>>>>>
>>>>>>>>>>>> As you know I have been having my doubts about this change.
>>>>>>>>>>>> To me it seems insane to be forced to read the fence
>>>>>>>>>>>> pointer under a
>>>>>>>>>>>> reserved lock, simply because when you take the reserve
>>>>>>>>>>>> lock, another
>>>>>>>>>>>> process may have it and there is a substantial chance that
>>>>>>>>>>>> that process
>>>>>>>>>>>> will also be waiting for idle while holding the reserve lock.
>>>>>>>>>> I think it makes perfect sense, the only times you want to
>>>>>>>>>> read the fence
>>>>>>>>>> is when you want to change the members protected by the
>>>>>>>>>> reservation.
>>>>>>>>> No, that's not true. A typical case (or the only case)
>>>>>>>>> is where you want to do a map with no_wait semantics. You will
>>>>>>>>> want
>>>>>>>>> to be able to access a buffer's results even if the eviction code
>>>>>>>>> is about to schedule an unbind from the GPU, and have the buffer
>>>>>>>>> reserved?
>>>>>>>> Well either block on reserve or return -EBUSY if reserved,
>>>>>>>> presumably the
>>>>>>>> former..
>>>>>>>>
>>>>>>>> ttm_bo_vm_fault does the latter already, anyway
>>>>>>> ttm_bo_vm_fault only trylocks to avoid a deadlock with mmap_sem,
>>>>>>> it's really
>>>>>>> a waiting reserve but for different reasons. Typically a
>>>>>>> user-space app will keep
>>>>>>> asynchronous maps to TTM during a buffer lifetime, and the
>>>>>>> buffer lifetime may
>>>>>>> be long as user-space apps keep caches.
>>>>>>> That's not the same as accessing a buffer after the GPU is done
>>>>>>> with it.
>>>>>> Ah indeed.
>>>>>>>> You don't need to hold the reservation while performing the
>>>>>>>> wait itself though,
>>>>>>>> you could check if ttm_bo_wait(no_wait_gpu = true) returns
>>>>>>>> -EBUSY, and if so
>>>>>>>> take a ref to the sync_obj member and then wait after
>>>>>>>> unreserving. You won't
>>>>>>>> reset sync_obj member to NULL in that case, but that should be
>>>>>>>> harmless.
>>>>>>>> This will allow you to keep the reservations fast and short.
>>>>>>>> Maybe a helper
>>>>>>>> would be appropriate for this since radeon and nouveau both
>>>>>>>> seem to do this.
>>>>>>>>
>>>>>>> The problem is that as long as other users are waiting for idle
>>>>>>> with reservation
>>>>>>> held, for example to switch GPU engine or to delete a GPU bind,
>>>>>>> waiting
>>>>>>> for reserve will in many case mean wait for GPU.
>>>>>> This example sounds inefficient, I know nouveau can do this, but
>>>>>> this essentially
>>>>>> just stalls the gpu entirely. I think guess you mean functions
>>>>>> like nouveau_gem_object_close?
>>>>>> It wouldn't surprise me if performance in nouveau could be
>>>>>> improved simply by
>>>>>> fixing those cases up instead, since it stalls the application
>>>>>> completely too for other uses.
>>>>>>
>>>>> Please see the Nouveau cpu_prep patch as well.
>>>>>
>>>>> While there are a number of cases that can be fixed up, also in
>>>>> Radeon, there's no way around that reservation
>>>>> is a heavyweight lock that, particularly on simpler hardware without
>>>>> support for fence ordering
>>>>> with barriers and / or "semaphores" and accelerated eviction will be
>>>>> held while waiting for idle.
>>>>>
>>>>> As such, it is unsuitable to protect read access to the fence
>>>>> pointer. If the intention is to keep a single fence pointer
>>>>> I think the best solution is to allow reading the fence pointer
>>>>> outside reservation, but make sure this can be done
>>>>> atomically. If the intention is to protect an array or list of fence
>>>>> pointers, I think a spinlock is the better solution.
>>>>>
>>>>> /Thomas
>>>> Just wanted to point out that like Thomas i am concern about having to
>>>> have object reserved when waiting on its associated fence. I fear it
>>>> will hurt us somehow.
>>>>
>>>> I will try to spend couple days stress testing your branch on radeon
>>>> trying to see if it hurts performance anyhow with current use case.
>>>>
>>> I've been trying to figure out what to do with Maarten's patches going
>>> forward since they are essential for all kinds of SoC people,
>>>
>>> However I'm still not 100% sure I believe either the fact that the
>>> problem is anything more than a microoptimisation, and possibly a
>>> premature one at that, this needs someone to start suggesting
>>> benchmarks we can run and a driver set to run them on, otherwise I'm
>>> starting to tend towards we are taking about an optimisation we can
>>> fix later,
>>>
>>> The other option is to somehow merge this stuff under an option that
>>> allows us to test it using a command line option, but I don't think
>>> that is sane either,
>>>
>>> So Maarten, Jerome, Thomas, please start discussing actual real world
>>> tests you think merging this code will affect and then I can make a
>>> better consideration.
>>>
>>> Dave.
>> Dave,
>>
>> This is IMHO a fundamental design discussion, not a micro-optimization.
>>
>> I'm pretty sure all sorts of horrendous things could be done to the DRM
>> design without affecting real world application performance.
>>
>> In TTM data protection is primarily done with spinlocks: This serves two
>> purposes.
>>
>> a) You can't unnecessarily hold a data protection lock while waiting for
>> GPU, which is typically a very stupid thing to do (perhaps not so in
>> this particular case) but when the sleep while atomic or locking
>> inversion kernel warning turns up, that should at least
>> ring a bell. Trying to implement dma-buf fencing using the TTM fencing
>> callbacks will AFAICT cause a locking inversion.
>>
>> b) Spinlocks essentially go away on UP systems. The whole reservation
>> path was essentially lock-free on UP systems pre dma-buf integration,
>> and with very few atomic locking operations even on SMP systems. It was
>> all prompted by a test some years ago (by Eric Anholt IIRC) showing that
>> locking operations turned up quite high on Intel driver profiling.
>>
>> If we start protecting data with reservations, when we also export the
>> reservation locks, so that people find them a convenient "serialize work
>> on this buffer" lock, all kind of interesting things might happen, and
>> we might end up in a situation
>> similar to protecting everything with struct_mutex.
>>
>> This is why I dislike this change. In particular when there is a very
>> simple remedy.
>>
>> But if I can get an ACK to convert the reservation object fence pointers
>> and associated operations on them to be rcu-safe when I have some time
>> left, I'd be prepared to accept this patch series in it's current state.
> RCU could be a useful way to deal with this. But in my case I've shown
> there are very few places where it's needed, core ttm does not need it
> at all.
> This is why I wanted to leave it to individual drivers to implement it.
>
> I think kfree_rcu for free in the driver itself should be enough,
> and obtaining in the driver would require something like this, similar
> to whats in rcuref.txt:
>
> rcu_read_lock();
> f = rcu_dereference(bo->fence);
> if (f && !kref_get_unless-zero(&f->kref)) f = NULL;
> rcu_read_unlock();
>
> if (f) {
> // do stuff here
> ...
>
> // free fence
> kref_put(&f->kref, fence_put_with_kfree_rcu);
> }
>
> Am I wrong or is something like this enough to make sure fence is
> still alive?

No, you're correct.

And a bo check for idle would amount to (since we don't care if the
fence refcount is zero).

rcu_read_lock()
f = rcu_dereference(bo->fence);
signaled = !f || f->signaled;
rcu_read_unlock().

/Thomas

> There might still be a small bug when bo->fence's refcount is
> decreased before bo->fence is set to null. I haven't checked core ttm
> so that might need fixing.
>
> I added some more people to CC. It might be worthwhile having this in
> the core fence code and delete all fences with rcu, but I'm not
> completely certain about that yet.
>
> ~Maarten
>