Deadlock on PTEs update for HMM

Thu Dec 5 08:17:53 UTC 2019

Hi guys,

I was wondering why the heck Felix wants to add another lock for over an 
hour before I realized that I accidentally forget to send out patch #4 
and #5.

And yes what you describe is exactly what my patch #4 is doing and patch 
#5 is then a prove of concept to use the new interface for evictions of 
shared BOs.

Regards,
Christian.

Am 05.12.19 um 02:27 schrieb Felix Kuehling:
> [adding the mailing list back]
>
> Christian sent a 3-patch series that I just reviewed and commented on. 
> That removes the need to lock the page directory reservation for 
> updating page tables. That solves a part of the problem, but not all 
> of it.
>
> We still need to implement that driver lock for HMM, and add a 
> vm_resident flag in the vm structure that could be checked in the new 
> amdgpu_vm_evictable function Christian added. That function still runs 
> in atomic context (under the ttm_bo_glob.lru_lock spinlock). So we 
> can't lock a mutex there. But we can use trylock and return "not 
> evictable" if we fail to take the mutex.
>
> If we take the mutex successfully, we can update the vm_resident flag 
> to false and allow the eviction.
>
> The rest should work just the way I proposed earlier.
>
> Regards,
> Â  Felix
>
> On 2019-12-04 8:01 p.m., Sierra Guiza, Alejandro (Alex) wrote:
>> [AMD Official Use Only - Internal Distribution Only]
>>
>> Hi Christian,
>> I wonder if you had have time to check on this implementation?
>> Please let me know if there's something that I could help you with
>>
>> Regards,
>> Alex
>>
>> -----Original Message-----
>> From: Christian KÃ¶nig <ckoenig.leichtzumerken at gmail.com>
>> Sent: Friday, November 29, 2019 1:41 AM
>> To: Kuehling, Felix <Felix.Kuehling at amd.com>; Koenig, Christian 
>> <Christian.Koenig at amd.com>; Sierra Guiza, Alejandro (Alex) 
>> <Alex.Sierra at amd.com>
>> Cc: amd-gfx at lists.freedesktop.org
>> Subject: Re: Deadlock on PTEs update for HMM
>>
>> [CAUTION: External Email]
>>
>> Hi Felix,
>>
>> yes that is exactly my thinking as well. Problem is that getting this 
>> to work was much harder than I thought.
>>
>> We can't use a mutex cause TTM is calling the eviction callback in 
>> atomic context. A spinlock doesn't looked like a good idea either 
>> because we potentially need to wait for the hardware with a fixed IB 
>> pool.
>>
>> Because of this I've started to rewrite the TTM handling to not call 
>> the driver in an atomic context any more, but that took me way longer 
>> than expected as well.
>>
>> I'm currently experimenting with using a trylock driver mutex, that 
>> at least that should work for now until we got something better.
>>
>> Regards,
>> Christian.
>>
>> Am 28.11.19 um 21:30 schrieb Felix Kuehling:
>>> Hi Christian,
>>>
>>> I'm thinking about this problem, trying to come up with a solution.
>>> The fundamental problem is that we need low-overhead access to the
>>> page table in the MMU notifier, without much memory management or
>>> locking.
>>>
>>> There is one "driver lock" that we're supposed to take in the MMU
>>> notifier as well as when we update page tables that is prescribed by
>>> the HMM documentation (Documentation/vm/hmm.rst). I don't currently
>>> see such a lock in amdgpu. We'll probably need to add that anyway,
>>> with all the usual precautions about lock dependencies around MMU
>>> notifiers. Then we could use that lock to protect page table residency
>>> state, in addition to the reservation of the top-level page directory.
>>>
>>> We don't want to block eviction of page tables unconditionally, so the
>>> MMU notifier must be able to deal with the situation that page tables
>>> are not resident at the moment. But the lock can delay page tables
>>> from being evicted while an MMU notifier is in progress and protect us
>>> from race conditions between MMU notifiers invalidating PTEs, and page
>>> tables getting evicted.
>>>
>>> amdgpu_vm_bo_invalidate could detect when a page table is being
>>> evicted, and update a new "vm_resident" flag inside the amdgpu_vm
>>> while holding the "HMM driver lock". If an MMU notifier is in
>>> progress, trying to take the "HMM driver lock" will delay the eviction
>>> long enough for any pending PTE invalidation to complete.
>>>
>>> In the case that page tables are not resident (vm_resident flag is
>>> false), it means the GPU is currently not accessing any memory in that
>>> amdgpu_vm address space. So we don't need to invalidate the PTEs right
>>> away. I think we could implement a deferred invalidation mechanism for
>>> this case, that delays the invalidation until the next time the page
>>> tables are made resident. amdgpu_amdkfd_gpuvm_restore_process_bos
>>> would replay any deferred PTE invalidations after validating the page
>>> tables but before restarting the user mode queues for the process. If
>>> graphics ever implements page-fault-based memory management, you'd
>>> need to do something similar in amdgpu_cs.
>>>
>>> Once all that is in place, we should be able to update PTEs in MMU
>>> notifiers without reserving the page tables.
>>>
>>> If we use SDMA for updating page tables, we may need a pre-allocated
>>> IB for use in MMU notifiers. And there is problably other details to
>>> be worked out about exactly how we implement the PTE invalidation in
>>> MMU notifiers and reflecting that in the state of the amdgpu_vm and
>>> amdgpu_bo_va_mapping.
>>>
>>> Does this idea sound reasonable to you? Can you think of a simpler
>>> solution?
>>>
>>> Thanks,
>>> Â Â  Felix
>>>
>>> On 2019-11-27 10:02 a.m., Christian KÃ¶nig wrote:
>>>> Hi Alejandro,
>>>>
>>>> yes I'm very aware of this issue, but unfortunately can't give an
>>>> easy solution either.
>>>>
>>>> I'm working for over a year now on getting this fixed, but
>>>> unfortunately it turned out that this problem is much bigger than
>>>> initially thought.
>>>>
>>>> Setting the appropriate GFP flags for the job allocation is actually
>>>> the trivial part.
>>>>
>>>> The really really hard thing is that we somehow need to add a lock to
>>>> prevent the page tables from being evicted. And as you also figured
>>>> out that lock can't be taken easily anywhere else.
>>>>
>>>> I've already wrote a prototype for this, but didn't had time to
>>>> hammer it into shape for upstreaming yet.
>>>>
>>>> Regards,
>>>> Christian.
>>>>
>>>> Am 27.11.19 um 15:55 schrieb Sierra Guiza, Alejandro (Alex):
>>>>> Hi Christian,
>>>>>
>>>>> As you know, weâ€™re working on the HMM enablement. Im working on the
>>>>> dGPU page table entries invalidation on the userptr mapping case.
>>>>> Currently, the MMU notifiers handle stops all user mode queues,
>>>>> schedule a delayed worker to re-validate userptr mappings and
>>>>> restart the queues.
>>>>>
>>>>> Part of the HMM functionality, we need to invalidate the page table
>>>>> entries instead of stopping the queues. At the same time we need to
>>>>> move the revalidation of the userptr mappings into the page fault
>>>>> handler.
>>>>>
>>>>> Weâ€™re seeing a deadlock warning after we try to invalidate the PTEs
>>>>> inside the MMU notifier handler. More specific, when we try to
>>>>> update the BOs to invalidate PTEs using amdgpu_vm_bo_update. This
>>>>> uses kmalloc on the amdgpu_job_alloc which seems to be causing this
>>>>> problem.
>>>>>
>>>>> Based on @Kuehling, Felix <mailto:Felix.Kuehling at amd.com> comments,
>>>>> kmalloc without any special flags can cause memory reclaim. Doing
>>>>> that inside an MMU notifier is problematic, because an MMU notifier
>>>>> may be called inside a memory-reclaim operation itself. That would
>>>>> result in recursion. Also, reclaim shouldn't be done while holding a
>>>>> lock that can be taken in an MMU notifier for the same reason. If
>>>>> you cause a reclaim while holding that lock, then an MMU notifier
>>>>> called by the reclaim can deadlock trying to take the same lock.
>>>>>
>>>>> Please let us know if you have any advice to enable this the right
>>>>> way
>>>>>
>>>>> Thanks in advanced,
>>>>>
>>>>> Alejandro
>>>>>
>>> _______________________________________________
>>> amd-gfx mailing list
>>> amd-gfx at lists.freedesktop.org
>>> https://list
>>> s.freedesktop.org/mailman/listinfo/amd-gfx
>>> ex.Sierra%40amd.com%7Cc6488f54fb204dcb398508d7749f70c1%7C3dd8961fe4884
>>> e608e11a82d994e183d%7C0%7C0%7C637106100468561307&sdata=toPB8T89gK%
>>> 2BUesFOq%2F7kofY5o3%2FH5aDMegklBj4gzqs%3D&reserved=0
> _______________________________________________
> amd-gfx mailing list
> amd-gfx at lists.freedesktop.org
> https://lists.freedesktop.org/mailman/listinfo/amd-gfx