Another circular lock dependency involving fs-reclaim and drm_buddy

Thu Dec 8 12:32:23 UTC 2022

Hi Felix,

digging though the code I think I know now how we can solve this.

The lock which needs to protect the validity of the pages is the 
vm->eviction_lock, cause this one is what serializes the updates of our 
page table.

This means that the amdgpu_vm_update_range() function needs to be 
extended further with enough information to check the HMM sequence after 
grabbing the eviction lock.

Or am I missing something here?

Regards,
Christian.

Am 06.12.22 um 16:57 schrieb Christian König:
> Am 06.12.22 um 16:14 schrieb Felix Kuehling:
>> Am 2022-12-06 um 03:20 schrieb Christian König:
>>> Hi Felix,
>>>
>>> to be honest I think the whole approach you take here and unmapping 
>>> the range with the lock held is highly problematic.
>>>
>>> Essentially unmapping is something which can run concurrently with 
>>> other operations and should never be blocked by such a lock. In 
>>> other words it is perfectly valid for an unmap to run while another 
>>> processor tries to map the same range.
>>
>> I strongly disagree. When an MMU notifier invalidates a virtual 
>> address range, it must guarantee that the device will not access the 
>> current physical location of the invalidated range any more. Without 
>> the lock, another processor could easily overwrite the invalidated 
>> PTEs with valid (but outdated) addresses that the MMU notifier just 
>> removed.
>>
>> This is the relevant paragraph in the HMM doc:
>>
>>    [example of driver updating its page table ...]
>>           take_lock(driver->update);
>>           if (mmu_interval_read_retry(&ni, range.notifier_seq) {
>>               release_lock(driver->update);
>>               goto again;
>>           }
>>
>>           /* Use pfns array content to update device page table,
>>            * under the update lock */
>>
>>           release_lock(driver->update);
>>           return 0;
>>      }
>>
>>    The driver->update lock is the same lock that the driver takes
>>    inside its
>>    invalidate() callback. That lock must be held before calling
>>    mmu_interval_read_retry() to avoid any race with a concurrent CPU
>>    page table
>>    update.
>
>
> There is something massively missing here since that won't work at all.
>
> The problem is that we can't allocate memory nor fault pages while any 
> lock is held which is also held inside the invalidation callback of 
> the MMU notifier.
>
> Updating the PTEs will always need memory, e.g. we can't guarantee 
> that we won't allocate page tables. Otherwise dynamic allocating 
> tables won't work any more.
>
>>>
>>> That's why the same page fault can occur multiple times until the 
>>> page tables are up to date.
>>>
>>> When you get an MMU notification that range A->B is invalidated you 
>>> trigger invalidation of range A->B inside the GPUVM address space as 
>>> well and that's completely independent what's mapped there.
>>
>> Not exactly. Because we're in a shared virtual address space, we know 
>> exactly what's mapped there.
>
> Yeah, but that's irrelevant. We are not allowed to use this 
> information, cause otherwise we run exactly into the described issues 
> again.
>
>> If the MMU notifier invalidates it, we have to coordinate with it to 
>> make sure we don't write incorrect addresses into those PTEs. In the 
>> example above, holding the lock guarantees that a concurrent MMU 
>> notifier will not return until after it has overwritten the PTEs we 
>> just updated. Alternatively mmu_interval_read_retry fails, which 
>> means the MMU notifier got there first and we have to get updated 
>> addresses with hmm_range_fault. Either way, the page table ends up 
>> with PTEs with V=0 before the MMU notifier returns.
>>
>> Without holding the lock while writing the page table, we could end 
>> up with a mix of valid (V=1, but outdated address) and invalid PTEs 
>> when there is a concurrent validation and invalidation of the same 
>> virtual address range.
>
> Yeah, I see the problem now as well. But I can't come up with any way 
> to solve this.
>
> Regards,
> Christian.
>
>>
>> Regards,
>>   Felix
>>
>>
>>>
>>> Invalidating the mappings and eventually scheduling that they are 
>>> re-created is a separate step which should come independent of this 
>>> if I'm not completely mistaken.
>>>
>>> Regards,
>>> Christian.
>>>
>>> Am 06.12.22 um 01:04 schrieb Felix Kuehling:
>>>> We fixed a similar issue with Philip's patch "drm/amdgpu: Drop 
>>>> eviction lock when allocating PT BO", but there was another one 
>>>> hiding underneath that (see the log below). The problem is, that 
>>>> we're still allocating page tables while holding the prange->lock 
>>>> in the kfd_svm code, which is also held in MMU notifiers. This 
>>>> creates a lock dependency between the vram_mgr->lock and fs-reclaim.
>>>>
>>>> There are three ways around this:
>>>>
>>>>  * Do the page table mapping outside prange->lock (creates a race,
>>>>    contradicts advice in Documentation/vm/hmm.rst)
>>>>  * Avoid all page table allocation in amdgpu_vm_update_range (i.e.
>>>>    allocate page tables ahead of time somehow)
>>>>  * Wrap vram_mgr->lock with memalloc_noreclaim_save/restore (may cause
>>>>    memory allocations to fail in drm_buddy under memory pressure)
>>>>
>>>> I only mention the first one for completeness. It's not a valid 
>>>> solution. Neither of the remaining two options are particularly 
>>>> appealing either.
>>>>
>>>> Regards,
>>>>   Felix
>>>>
>>>>
>>>> [   83.979486] ======================================================
>>>> [   83.986583] WARNING: possible circular locking dependency detected
>>>> [   83.993643] 5.19.0-kfd-fkuehlin #75 Not tainted
>>>> [   83.999044] ------------------------------------------------------
>>>> [   84.006088] kfdtest/3453 is trying to acquire lock:
>>>> [   84.011820] ffff9a998561e210 (&prange->lock){+.+.}-{3:3}, at: 
>>>> svm_range_cpu_invalidate_pagetables+0x90/0x5e0 [amdgpu]
>>>> [   84.023911]
>>>>                but task is already holding lock:
>>>> [   84.031608] ffffffffbcd929c0 
>>>> (mmu_notifier_invalidate_range_start){+.+.}-{0:0}, at: 
>>>> unmap_vmas+0x5/0x170
>>>> [   84.041992]
>>>>                which lock already depends on the new lock.
>>>>
>>>> [   84.052785]
>>>>                the existing dependency chain (in reverse order) is:
>>>> [   84.061993]
>>>>                -> #3 
>>>> (mmu_notifier_invalidate_range_start){+.+.}-{0:0}:
>>>> [   84.071548]        fs_reclaim_acquire+0x6d/0xd0
>>>> [   84.076941]        kmem_cache_alloc_trace+0x34/0x760
>>>> [   84.082766]        alloc_workqueue_attrs+0x1b/0x50
>>>> [   84.088411]        workqueue_init+0x88/0x318
>>>> [   84.093533]        kernel_init_freeable+0x134/0x28f
>>>> [   84.099258]        kernel_init+0x16/0x130
>>>> [   84.104107]        ret_from_fork+0x1f/0x30
>>>> [   84.109038]
>>>>                -> #2 (fs_reclaim){+.+.}-{0:0}:
>>>> [   84.116348]        fs_reclaim_acquire+0xa1/0xd0
>>>> [   84.121697]        kmem_cache_alloc+0x2c/0x760
>>>> [   84.126948]        drm_block_alloc.isra.0+0x27/0x50 [drm_buddy]
>>>> [   84.133679]        split_block+0x4d/0x140 [drm_buddy]
>>>> [   84.139539]        drm_buddy_alloc_blocks+0x385/0x580 [drm_buddy]
>>>> [   84.146435]        amdgpu_vram_mgr_new+0x213/0x4f0 [amdgpu]
>>>> [   84.153399]        ttm_resource_alloc+0x31/0x80 [ttm]
>>>> [   84.159366]        ttm_bo_mem_space+0x8f/0x230 [ttm]
>>>> [   84.165169]        ttm_bo_validate+0xc5/0x170 [ttm]
>>>> [   84.170872]        ttm_bo_init_reserved+0x1a6/0x230 [ttm]
>>>> [   84.177075]        amdgpu_bo_create+0x1a0/0x510 [amdgpu]
>>>> [   84.183600]        amdgpu_bo_create_reserved+0x188/0x1e0 [amdgpu]
>>>> [   84.190803]        amdgpu_bo_create_kernel_at+0x64/0x200 [amdgpu]
>>>> [   84.197994]        amdgpu_ttm_init+0x420/0x4c0 [amdgpu]
>>>> [   84.204301]        gmc_v10_0_sw_init+0x33a/0x530 [amdgpu]
>>>> [   84.210813]        amdgpu_device_init.cold+0x10d4/0x17a1 [amdgpu]
>>>> [   84.218077]        amdgpu_driver_load_kms+0x15/0x110 [amdgpu]
>>>> [   84.224919]        amdgpu_pci_probe+0x142/0x350 [amdgpu]
>>>> [   84.231313]        local_pci_probe+0x40/0x80
>>>> [   84.236437]        work_for_cpu_fn+0x10/0x20
>>>> [   84.241500]        process_one_work+0x270/0x5a0
>>>> [   84.246805]        worker_thread+0x203/0x3c0
>>>> [   84.251828]        kthread+0xea/0x110
>>>> [   84.256229]        ret_from_fork+0x1f/0x30
>>>> [   84.261061]
>>>>                -> #1 (&mgr->lock){+.+.}-{3:3}:
>>>> [   84.268156]        __mutex_lock+0x9a/0xf30
>>>> [   84.272967]        amdgpu_vram_mgr_new+0x14a/0x4f0 [amdgpu]
>>>> [   84.279752]        ttm_resource_alloc+0x31/0x80 [ttm]
>>>> [   84.285602]        ttm_bo_mem_space+0x8f/0x230 [ttm]
>>>> [   84.291321]        ttm_bo_validate+0xc5/0x170 [ttm]
>>>> [   84.296939]        ttm_bo_init_reserved+0xe2/0x230 [ttm]
>>>> [   84.302969]        amdgpu_bo_create+0x1a0/0x510 [amdgpu]
>>>> [   84.309297]        amdgpu_bo_create_vm+0x2e/0x80 [amdgpu]
>>>> [   84.315656]        amdgpu_vm_pt_create+0xf5/0x270 [amdgpu]
>>>> [   84.322090]        amdgpu_vm_ptes_update+0x6c4/0x8f0 [amdgpu]
>>>> [   84.328793]        amdgpu_vm_update_range+0x29b/0x730 [amdgpu]
>>>> [   84.335537] svm_range_validate_and_map+0xc78/0x1390 [amdgpu]
>>>> [   84.342734]        svm_range_set_attr+0xc74/0x1170 [amdgpu]
>>>> [   84.349222]        kfd_ioctl+0x189/0x5c0 [amdgpu]
>>>> [   84.354808]        __x64_sys_ioctl+0x80/0xb0
>>>> [   84.359738]        do_syscall_64+0x35/0x80
>>>> [   84.364481] entry_SYSCALL_64_after_hwframe+0x63/0xcd
>>>> [   84.370687]
>>>>                -> #0 (&prange->lock){+.+.}-{3:3}:
>>>> [   84.377864]        __lock_acquire+0x12ed/0x27e0
>>>> [   84.383027]        lock_acquire+0xca/0x2e0
>>>> [   84.387759]        __mutex_lock+0x9a/0xf30
>>>> [   84.392491] svm_range_cpu_invalidate_pagetables+0x90/0x5e0 [amdgpu]
>>>> [   84.400345] __mmu_notifier_invalidate_range_start+0x1d3/0x230
>>>> [   84.407410]        unmap_vmas+0x162/0x170
>>>> [   84.412126]        unmap_region+0xa8/0x110
>>>> [   84.416905]        __do_munmap+0x209/0x4f0
>>>> [   84.421680]        __vm_munmap+0x78/0x120
>>>> [   84.426365]        __x64_sys_munmap+0x17/0x20
>>>> [   84.431392]        do_syscall_64+0x35/0x80
>>>> [   84.436164] entry_SYSCALL_64_after_hwframe+0x63/0xcd
>>>> [   84.442405]
>>>>                other info that might help us debug this:
>>>>
>>>> [   84.452431] Chain exists of:
>>>>                  &prange->lock --> fs_reclaim --> 
>>>> mmu_notifier_invalidate_range_start
>>>>
>>>> [   84.466395]  Possible unsafe locking scenario:
>>>>
>>>> [   84.473720]        CPU0                    CPU1
>>>> [   84.479020]        ----                    ----
>>>> [   84.484296]   lock(mmu_notifier_invalidate_range_start);
>>>> [   84.490333] lock(fs_reclaim);
>>>> [   84.496705] lock(mmu_notifier_invalidate_range_start);
>>>> [   84.505246]   lock(&prange->lock);
>>>> [   84.509361]
>>>>                 *** DEADLOCK ***
>>>>
>>>> [   84.517360] 2 locks held by kfdtest/3453:
>>>> [   84.522060]  #0: ffff9a99821ec4a8 
>>>> (&mm->mmap_lock#2){++++}-{3:3}, at: __do_munmap+0x417/0x4f0
>>>> [   84.531287]  #1: ffffffffbcd929c0 
>>>> (mmu_notifier_invalidate_range_start){+.+.}-{0:0}, at: 
>>>> unmap_vmas+0x5/0x170
>>>> [   84.541896]
>>>>                stack backtrace:
>>>> [   84.547630] CPU: 3 PID: 3453 Comm: kfdtest Not tainted 
>>>> 5.19.0-kfd-fkuehlin #75
>>>> [   84.555537] Hardware name: To Be Filled By O.E.M. To Be Filled 
>>>> By O.E.M./EPYCD8-2T, BIOS P2.60 04/10/2020
>>>> [   84.565788] Call Trace:
>>>> [   84.568925]  <TASK>
>>>> [   84.571702]  dump_stack_lvl+0x45/0x59
>>>> [   84.576034]  check_noncircular+0xfe/0x110
>>>> [   84.580715]  ? kernel_text_address+0x6d/0xe0
>>>> [   84.585652]  __lock_acquire+0x12ed/0x27e0
>>>> [   84.590340]  lock_acquire+0xca/0x2e0
>>>> [   84.594595]  ? svm_range_cpu_invalidate_pagetables+0x90/0x5e0 
>>>> [amdgpu]
>>>> [   84.602338]  __mutex_lock+0x9a/0xf30
>>>> [   84.606714]  ? svm_range_cpu_invalidate_pagetables+0x90/0x5e0 
>>>> [amdgpu]
>>>> [   84.614262]  ? svm_range_cpu_invalidate_pagetables+0x90/0x5e0 
>>>> [amdgpu]
>>>> [   84.621806]  ? svm_range_cpu_invalidate_pagetables+0x90/0x5e0 
>>>> [amdgpu]
>>>> [   84.629353] svm_range_cpu_invalidate_pagetables+0x90/0x5e0 [amdgpu]
>>>> [   84.636742]  ? lock_release+0x139/0x2b0
>>>> [   84.641374] __mmu_notifier_invalidate_range_start+0x1d3/0x230
>>>> [   84.647976]  unmap_vmas+0x162/0x170
>>>> [   84.652203]  unmap_region+0xa8/0x110
>>>> [   84.656503]  __do_munmap+0x209/0x4f0
>>>> [   84.660792]  __vm_munmap+0x78/0x120
>>>> [   84.664977]  __x64_sys_munmap+0x17/0x20
>>>> [   84.669499]  do_syscall_64+0x35/0x80
>>>> [   84.673755]  entry_SYSCALL_64_after_hwframe+0x63/0xcd
>>>> [   84.679485] RIP: 0033:0x7f32872eb97b
>>>> [   84.683738] Code: 8b 15 19 35 0d 00 f7 d8 64 89 02 48 c7 c0 ff 
>>>> ff ff ff eb 89 66 2e 0f 1f 84 00 00 00 00 00 90 f3 0f 1e fa b8 0b 
>>>> 00 00 00 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d e5 34 0d 00 f7 
>>>> d8 64 89 01 48
>>>> [   84.703915] RSP: 002b:00007fffb06c4508 EFLAGS: 00000246 
>>>> ORIG_RAX: 000000000000000b
>>>> [   84.712205] RAX: ffffffffffffffda RBX: 0000000000000001 RCX: 
>>>> 00007f32872eb97b
>>>> [   84.720072] RDX: 0000000004000000 RSI: 0000000004000000 RDI: 
>>>> 00007f32831ae000
>>>> [   84.727944] RBP: 00007fffb06c4750 R08: 00007fffb06c4548 R09: 
>>>> 000055e7570ad230
>>>> [   84.735809] R10: 000055e757088010 R11: 0000000000000246 R12: 
>>>> 000055e75453cefa
>>>> [   84.743688] R13: 0000000000000000 R14: 0000000000000021 R15: 
>>>> 0000000000000000
>>>> [   84.751584]  </TASK>
>>>>
>>>>
>>>
>