Deadlock on PTEs update for HMM

Thu Dec 5 01:27:27 UTC 2019

[adding the mailing list back]

Christian sent a 3-patch series that I just reviewed and commented on. 
That removes the need to lock the page directory reservation for 
updating page tables. That solves a part of the problem, but not all of it.

We still need to implement that driver lock for HMM, and add a 
vm_resident flag in the vm structure that could be checked in the new 
amdgpu_vm_evictable function Christian added. That function still runs 
in atomic context (under the ttm_bo_glob.lru_lock spinlock). So we can't 
lock a mutex there. But we can use trylock and return "not evictable" if 
we fail to take the mutex.

If we take the mutex successfully, we can update the vm_resident flag to 
false and allow the eviction.

The rest should work just the way I proposed earlier.

Regards,
 Â  Felix

On 2019-12-04 8:01 p.m., Sierra Guiza, Alejandro (Alex) wrote:
> [AMD Official Use Only - Internal Distribution Only]
>
> Hi Christian,
> I wonder if you had have time to check on this implementation?
> Please let me know if there's something that I could help you with
>
> Regards,
> Alex
>
> -----Original Message-----
> From: Christian KÃ¶nig <ckoenig.leichtzumerken at gmail.com>
> Sent: Friday, November 29, 2019 1:41 AM
> To: Kuehling, Felix <Felix.Kuehling at amd.com>; Koenig, Christian <Christian.Koenig at amd.com>; Sierra Guiza, Alejandro (Alex) <Alex.Sierra at amd.com>
> Cc: amd-gfx at lists.freedesktop.org
> Subject: Re: Deadlock on PTEs update for HMM
>
> [CAUTION: External Email]
>
> Hi Felix,
>
> yes that is exactly my thinking as well. Problem is that getting this to work was much harder than I thought.
>
> We can't use a mutex cause TTM is calling the eviction callback in atomic context. A spinlock doesn't looked like a good idea either because we potentially need to wait for the hardware with a fixed IB pool.
>
> Because of this I've started to rewrite the TTM handling to not call the driver in an atomic context any more, but that took me way longer than expected as well.
>
> I'm currently experimenting with using a trylock driver mutex, that at least that should work for now until we got something better.
>
> Regards,
> Christian.
>
> Am 28.11.19 um 21:30 schrieb Felix Kuehling:
>> Hi Christian,
>>
>> I'm thinking about this problem, trying to come up with a solution.
>> The fundamental problem is that we need low-overhead access to the
>> page table in the MMU notifier, without much memory management or
>> locking.
>>
>> There is one "driver lock" that we're supposed to take in the MMU
>> notifier as well as when we update page tables that is prescribed by
>> the HMM documentation (Documentation/vm/hmm.rst). I don't currently
>> see such a lock in amdgpu. We'll probably need to add that anyway,
>> with all the usual precautions about lock dependencies around MMU
>> notifiers. Then we could use that lock to protect page table residency
>> state, in addition to the reservation of the top-level page directory.
>>
>> We don't want to block eviction of page tables unconditionally, so the
>> MMU notifier must be able to deal with the situation that page tables
>> are not resident at the moment. But the lock can delay page tables
>> from being evicted while an MMU notifier is in progress and protect us
>> from race conditions between MMU notifiers invalidating PTEs, and page
>> tables getting evicted.
>>
>> amdgpu_vm_bo_invalidate could detect when a page table is being
>> evicted, and update a new "vm_resident" flag inside the amdgpu_vm
>> while holding the "HMM driver lock". If an MMU notifier is in
>> progress, trying to take the "HMM driver lock" will delay the eviction
>> long enough for any pending PTE invalidation to complete.
>>
>> In the case that page tables are not resident (vm_resident flag is
>> false), it means the GPU is currently not accessing any memory in that
>> amdgpu_vm address space. So we don't need to invalidate the PTEs right
>> away. I think we could implement a deferred invalidation mechanism for
>> this case, that delays the invalidation until the next time the page
>> tables are made resident. amdgpu_amdkfd_gpuvm_restore_process_bos
>> would replay any deferred PTE invalidations after validating the page
>> tables but before restarting the user mode queues for the process. If
>> graphics ever implements page-fault-based memory management, you'd
>> need to do something similar in amdgpu_cs.
>>
>> Once all that is in place, we should be able to update PTEs in MMU
>> notifiers without reserving the page tables.
>>
>> If we use SDMA for updating page tables, we may need a pre-allocated
>> IB for use in MMU notifiers. And there is problably other details to
>> be worked out about exactly how we implement the PTE invalidation in
>> MMU notifiers and reflecting that in the state of the amdgpu_vm and
>> amdgpu_bo_va_mapping.
>>
>> Does this idea sound reasonable to you? Can you think of a simpler
>> solution?
>>
>> Thanks,
>>    Felix
>>
>> On 2019-11-27 10:02 a.m., Christian KÃ¶nig wrote:
>>> Hi Alejandro,
>>>
>>> yes I'm very aware of this issue, but unfortunately can't give an
>>> easy solution either.
>>>
>>> I'm working for over a year now on getting this fixed, but
>>> unfortunately it turned out that this problem is much bigger than
>>> initially thought.
>>>
>>> Setting the appropriate GFP flags for the job allocation is actually
>>> the trivial part.
>>>
>>> The really really hard thing is that we somehow need to add a lock to
>>> prevent the page tables from being evicted. And as you also figured
>>> out that lock can't be taken easily anywhere else.
>>>
>>> I've already wrote a prototype for this, but didn't had time to
>>> hammer it into shape for upstreaming yet.
>>>
>>> Regards,
>>> Christian.
>>>
>>> Am 27.11.19 um 15:55 schrieb Sierra Guiza, Alejandro (Alex):
>>>> Hi Christian,
>>>>
>>>> As you know, weâ€™re working on the HMM enablement. Im working on the
>>>> dGPU page table entries invalidation on the userptr mapping case.
>>>> Currently, the MMU notifiers handle stops all user mode queues,
>>>> schedule a delayed worker to re-validate userptr mappings and
>>>> restart the queues.
>>>>
>>>> Part of the HMM functionality, we need to invalidate the page table
>>>> entries instead of stopping the queues. At the same time we need to
>>>> move the revalidation of the userptr mappings into the page fault
>>>> handler.
>>>>
>>>> Weâ€™re seeing a deadlock warning after we try to invalidate the PTEs
>>>> inside the MMU notifier handler. More specific, when we try to
>>>> update the BOs to invalidate PTEs using amdgpu_vm_bo_update. This
>>>> uses kmalloc on the amdgpu_job_alloc which seems to be causing this
>>>> problem.
>>>>
>>>> Based on @Kuehling, Felix <mailto:Felix.Kuehling at amd.com> comments,
>>>> kmalloc without any special flags can cause memory reclaim. Doing
>>>> that inside an MMU notifier is problematic, because an MMU notifier
>>>> may be called inside a memory-reclaim operation itself. That would
>>>> result in recursion. Also, reclaim shouldn't be done while holding a
>>>> lock that can be taken in an MMU notifier for the same reason. If
>>>> you cause a reclaim while holding that lock, then an MMU notifier
>>>> called by the reclaim can deadlock trying to take the same lock.
>>>>
>>>> Please let us know if you have any advice to enable this the right
>>>> way
>>>>
>>>> Thanks in advanced,
>>>>
>>>> Alejandro
>>>>
>> _______________________________________________
>> amd-gfx mailing list
>> amd-gfx at lists.freedesktop.org
>> https://list
>> s.freedesktop.org/mailman/listinfo/amd-gfx
>> ex.Sierra%40amd.com%7Cc6488f54fb204dcb398508d7749f70c1%7C3dd8961fe4884
>> e608e11a82d994e183d%7C0%7C0%7C637106100468561307&sdata=toPB8T89gK%
>> 2BUesFOq%2F7kofY5o3%2FH5aDMegklBj4gzqs%3D&reserved=0