TTM Locking order of bo::reserve -> vm::mmap_sem

Mon Oct 21 15:21:28 CEST 2013

op 21-10-13 13:10, Thomas Hellstrom schreef:
> On 10/21/2013 12:24 PM, Maarten Lankhorst wrote:
>> op 21-10-13 12:10, Thomas Hellstrom schreef:
>>> On 10/21/2013 11:48 AM, Maarten Lankhorst wrote:
>>>> op 21-10-13 11:37, Thomas Hellstrom schreef:
>>>>> On 10/21/2013 11:01 AM, Maarten Lankhorst wrote:
>>>>>> op 21-10-13 10:48, Thomas Hellstrom schreef:
>>>>>>> Hi!
>>>>>>>
>>>>>>> As discussed previously the current locking order in TTM of these locks is bo::reserve -> vm::mmap_sem. This leads to a hack in
>>>>>>> the TTM fault() handle to try and revert the locking order. If a tryreserve failed, we tried to have the vm code release the mmap_sem() and then schedule, to give the holder of bo::reserve a chance to release the lock. This solution is no longer legal, since we've been more or less kindly asked to remove the set_need_resched() call.
>>>>>>>
>>>>>>> Maarten has proposed to invert the locking order. I've previously said I had no strong preference. The current locking order dates back from the time when TTM wasn't using unmap_mapping_range() but walked the page tables itself, updating PTEs as needed. Furthermore it was needed for user bos that used get_user_pages() in the TTM populate and swap-in methods. User-bos were removed some time ago but I'm looking at re-adding them. They would suite the VMware model of cached-only pages very well. I see uses both in the gallium API, XA's DMA functionality and openCL.
>>>>>>>
>>>>>>> We would then need a somewhat nicer way to invert the locking order. I've attached a solution that ups the mmap_sem and then reserves, but due to how the fault API is done, we then need to release the reserve and retry the fault. This of course opens up for starvation, but I don't think starvation at this point is very likely: One thread being refused to write or read from a buffer object because the GPU is continously busy with it. If this *would* become a problem, it's probably possible to modify the fault code to allow us to hold locks until the retried fault, but that would be a bit invasive, since it touches the arch code....
>>>>>>>
>>>>>>> Basically I'm proposing to keep the current locking order.
>>>>>> I'm not sure why we have to worry about mmap_sem lock being taken before bo::reserve. If we already hold mmap_sem,
>>>>>> no extra locking is needed for get_user_pages.
>>>>> Typically, they are populated outside of fault, as part of execbuf, where we don't hold and don't want to hold mmap_sem(). In fact,
>>>>> user bo's should not be remappable through the TTM VM system. Anyway, we need to grab the mmap_sem inside ttm_populate for user buffers.
>>>> If we don't allow mmapping user bo's through TTM, we can use special lockdep annotation when user-bo's are used. Normal bo's would have
>>>> mmap_sem outer lock, bo::reserve inner lock, while those bo's would have the other way around.
>>>>
>>>> This might complicate validation a little, since you would have to reserve and validate all user-bo's before any normal bo's are reserved. But since this
>>>> is meant to be a vmwgfx specific optimization I think it might be worth it.
>>> Would that work (lockdep-wise) with user BO swapout as part of a normal BO validation? During user BO swapout, we don't need mmap_sem, but the BO needs to be reserved. But I guess it would work, since we use tryreserve when walking the LRU lists?
>> Correct.
>>
>>>>>>     Releasing it is a bit silly. I think we should keep mmap_sem as outer
>>>>>> lock, and have bo::reserve as inner, even if it might complicate support for user-bo's. I'm not sure what you can do
>>>>>> with user-bo's that can't be done by allocating the same bo from kernel first and map + populate it.
>>>>>>
>>>>>> ~Maarten
>>>>> Using DMA API analogy, user BOs correspond to using streaming DMA whereas normal BOs correspond to alloced DMA memory buffers.
>>>>> We boost performance and save resources.
>>>> Yeah but it's vmwgfx specific. Nouveau and radeon have dedicated copy engines that can be used. Flushing the vm and stalling is probably more
>>>> expensive than performing a memcpy
>>> In the end, I'm not sure it will be vmwgfx specific once there is a working example, and there are user-space APIs that will benefit from it. There are other examples out there today that uses streaming DMA to feed the DMA engines, although not through TTM, see for example via_dmablit.c.
>>>
>>> Also if we need a separate locking order for User BOs, what would be the big benefit of having the locking order mmap_sem()->bo_reserve() ?
>> Deterministic locking. I fear about livelocks that could happen otherwise with the right timing. Especially but not exclusively on -rt kernels.
> But livelocks wouldn't be an issue anymore since we use a waiting reserve in the retry path, right? The only issue we might theoretically be facing is starvation in the fault path.
>
> With a two-step validation scheme I think we're up to real issues, like bouncing ordinary BOs in and out of swap during a single execbuf...
I would need to see some code, but I think it can be avoided by having a slowpath similar to i915 if it fails.

Anyway have 2 threads fault on the same bo in different offsets, so fault handler is called twice on the same bo.

You could end up a loop, first one grabbed the bo lock without mmap_sem, second one does trylock, which fails then unlocks mmap_sem, and acquires the bo lock again. At which point the first thread acquired mmap_sem and does a trylock. So the same thing can theoretically happen infinitely over and over again.

~Maarten