[PATCH] drm/ttm: remove fence_lock

Fri Mar 21 01:27:48 PDT 2014

On 03/21/2014 12:55 AM, Dave Airlie wrote:
> On Fri, Oct 19, 2012 at 3:04 AM, Jerome Glisse <j.glisse at gmail.com> wrote:
>> On Thu, Oct 18, 2012 at 06:43:40PM +0200, Thomas Hellstrom wrote:
>>> On 10/18/2012 04:45 PM, Maarten Lankhorst wrote:
>>>> Op 18-10-12 13:55, Thomas Hellstrom schreef:
>>>>> On 10/18/2012 01:38 PM, Maarten Lankhorst wrote:
>>>>>> Op 18-10-12 13:02, Thomas Hellstrom schreef:
>>>>>>> On 10/18/2012 10:37 AM, Maarten Lankhorst wrote:
>>>>>>>> Hey,
>>>>>>>>
>>>>>>>> Op 18-10-12 09:59, Thomas Hellstrom schreef:
>>>>>>>>> On 10/18/2012 09:28 AM, Thomas Hellstrom wrote:
>>>>>>>>>> Hi, Maarten,
>>>>>>>>>>
>>>>>>>>>> As you know I have been having my doubts about this change.
>>>>>>>>>> To me it seems insane to be forced to read the fence pointer under a
>>>>>>>>>> reserved lock, simply because when you take the reserve lock, another
>>>>>>>>>> process may have it and there is a substantial chance that that process
>>>>>>>>>> will also be waiting for idle while holding the reserve lock.
>>>>>>>> I think it makes perfect sense, the only times you want to read the fence
>>>>>>>> is when you want to change the members protected by the reservation.
>>>>>>> No, that's not true. A typical case (or the only case)
>>>>>>> is where you want to do a map with no_wait semantics. You will want
>>>>>>> to be able to access a buffer's results even if the eviction code
>>>>>>> is about to schedule an unbind from the GPU, and have the buffer
>>>>>>> reserved?
>>>>>> Well either block on reserve or return -EBUSY if reserved, presumably the
>>>>>> former..
>>>>>>
>>>>>> ttm_bo_vm_fault does the latter already, anyway
>>>>> ttm_bo_vm_fault only trylocks to avoid a deadlock with mmap_sem, it's really
>>>>> a waiting reserve but for different reasons. Typically a user-space app will keep
>>>>> asynchronous maps to TTM during a buffer lifetime, and the buffer lifetime may
>>>>> be long as user-space apps keep caches.
>>>>> That's not the same as accessing a buffer after the GPU is done with it.
>>>> Ah indeed.
>>>>>> You don't need to hold the reservation while performing the wait itself though,
>>>>>> you could check if ttm_bo_wait(no_wait_gpu = true) returns -EBUSY, and if so
>>>>>> take a ref to the sync_obj member and then wait after unreserving. You won't
>>>>>> reset sync_obj member to NULL in that case, but that should be harmless.
>>>>>> This will allow you to keep the reservations fast and short. Maybe a helper
>>>>>> would be appropriate for this since radeon and nouveau both seem to do this.
>>>>>>
>>>>> The problem is that as long as other users are waiting for idle with reservation
>>>>> held, for example to switch GPU engine or to delete a GPU bind, waiting
>>>>> for reserve will in many case mean wait for GPU.
>>>> This example sounds inefficient, I know nouveau can do this, but this essentially
>>>> just stalls the gpu entirely. I think guess you mean functions like nouveau_gem_object_close?
>>>> It wouldn't surprise me if performance in nouveau could be improved simply by
>>>> fixing those cases up instead, since it stalls the application completely too for other uses.
>>>>
>>> Please see the Nouveau cpu_prep patch as well.
>>>
>>> While there are a number of cases that can be fixed up, also in
>>> Radeon, there's no way around that reservation
>>> is a heavyweight lock that, particularly on simpler hardware without
>>> support for fence ordering
>>> with barriers and / or "semaphores" and accelerated eviction will be
>>> held while waiting for idle.
>>>
>>> As such, it is unsuitable to protect read access to the fence
>>> pointer. If the intention is to keep a single fence pointer
>>> I think the best solution is to allow reading the fence pointer
>>> outside reservation, but make sure this can be done
>>> atomically. If the intention is to protect an array or list of fence
>>> pointers, I think a spinlock is the better solution.
>>>
>>> /Thomas
>> Just wanted to point out that like Thomas i am concern about having to
>> have object reserved when waiting on its associated fence. I fear it
>> will hurt us somehow.
>>
>> I will try to spend couple days stress testing your branch on radeon
>> trying to see if it hurts performance anyhow with current use case.
>>
> I've been trying to figure out what to do with Maarten's patches going
> forward since they are essential for all kinds of SoC people,
>
> However I'm still not 100% sure I believe either the fact that the
> problem is anything more than a microoptimisation, and possibly a
> premature one at that, this needs someone to start suggesting
> benchmarks we can run and a driver set to run them on, otherwise I'm
> starting to tend towards we are taking about an optimisation we can
> fix later,
>
> The other option is to somehow merge this stuff under an option that
> allows us to test it using a command line option, but I don't think
> that is sane either,
>
> So Maarten, Jerome, Thomas, please start discussing actual real world
> tests you think merging this code will affect and then I can make a
> better consideration.
>
> Dave.

Dave,

This is IMHO a fundamental design discussion, not a micro-optimization.

I'm pretty sure all sorts of horrendous things could be done to the DRM
design without affecting real world application performance.

In TTM data protection is primarily done with spinlocks: This serves two
purposes.

a) You can't unnecessarily hold a data protection lock while waiting for
GPU, which is typically a very stupid thing to do (perhaps not so in
this particular case) but when the sleep while atomic or locking
inversion kernel warning turns up, that should at least
ring a bell. Trying to implement dma-buf fencing using the TTM fencing
callbacks will AFAICT cause a locking inversion.

b) Spinlocks essentially go away on UP systems. The whole reservation
path was essentially lock-free on UP systems pre dma-buf integration,
and with very few atomic locking operations even on SMP systems. It was
all prompted by a test some years ago (by Eric Anholt IIRC) showing that
locking operations turned up quite high on Intel driver profiling.

If we start protecting data with reservations, when we also export the
reservation locks, so that people find them a convenient "serialize work
on this buffer" lock, all kind of interesting things might happen, and
we might end up in a situation
similar to protecting everything with struct_mutex.

This is why I dislike this change. In particular when there is a very
simple remedy.

But if I can get an ACK to convert the reservation object fence pointers
and associated operations on them to be rcu-safe when I have some time
left, I'd be prepared to accept this patch series in it's current state.

/Thomas