Another circular lock dependency involving fs-reclaim and drm_buddy

Tue Dec 6 15:14:33 UTC 2022

Am 2022-12-06 um 03:20 schrieb Christian König:
> Hi Felix,
>
> to be honest I think the whole approach you take here and unmapping 
> the range with the lock held is highly problematic.
>
> Essentially unmapping is something which can run concurrently with 
> other operations and should never be blocked by such a lock. In other 
> words it is perfectly valid for an unmap to run while another 
> processor tries to map the same range.

I strongly disagree. When an MMU notifier invalidates a virtual address 
range, it must guarantee that the device will not access the current 
physical location of the invalidated range any more. Without the lock, 
another processor could easily overwrite the invalidated PTEs with valid 
(but outdated) addresses that the MMU notifier just removed.

This is the relevant paragraph in the HMM doc:

    [example of driver updating its page table ...]
           take_lock(driver->update);
           if (mmu_interval_read_retry(&ni, range.notifier_seq) {
               release_lock(driver->update);
               goto again;
           }

           /* Use pfns array content to update device page table,
            * under the update lock */

           release_lock(driver->update);
           return 0;
      }

    The driver->update lock is the same lock that the driver takes
    inside its
    invalidate() callback. That lock must be held before calling
    mmu_interval_read_retry() to avoid any race with a concurrent CPU
    page table
    update.

>
> That's why the same page fault can occur multiple times until the page 
> tables are up to date.
>
> When you get an MMU notification that range A->B is invalidated you 
> trigger invalidation of range A->B inside the GPUVM address space as 
> well and that's completely independent what's mapped there.

Not exactly. Because we're in a shared virtual address space, we know 
exactly what's mapped there. If the MMU notifier invalidates it, we have 
to coordinate with it to make sure we don't write incorrect addresses 
into those PTEs. In the example above, holding the lock guarantees that 
a concurrent MMU notifier will not return until after it has overwritten 
the PTEs we just updated. Alternatively mmu_interval_read_retry fails, 
which means the MMU notifier got there first and we have to get updated 
addresses with hmm_range_fault. Either way, the page table ends up with 
PTEs with V=0 before the MMU notifier returns.

Without holding the lock while writing the page table, we could end up 
with a mix of valid (V=1, but outdated address) and invalid PTEs when 
there is a concurrent validation and invalidation of the same virtual 
address range.

Regards,
   Felix

>
> Invalidating the mappings and eventually scheduling that they are 
> re-created is a separate step which should come independent of this if 
> I'm not completely mistaken.
>
> Regards,
> Christian.
>
> Am 06.12.22 um 01:04 schrieb Felix Kuehling:
>> We fixed a similar issue with Philip's patch "drm/amdgpu: Drop 
>> eviction lock when allocating PT BO", but there was another one 
>> hiding underneath that (see the log below). The problem is, that 
>> we're still allocating page tables while holding the prange->lock in 
>> the kfd_svm code, which is also held in MMU notifiers. This creates a 
>> lock dependency between the vram_mgr->lock and fs-reclaim.
>>
>> There are three ways around this:
>>
>>  * Do the page table mapping outside prange->lock (creates a race,
>>    contradicts advice in Documentation/vm/hmm.rst)
>>  * Avoid all page table allocation in amdgpu_vm_update_range (i.e.
>>    allocate page tables ahead of time somehow)
>>  * Wrap vram_mgr->lock with memalloc_noreclaim_save/restore (may cause
>>    memory allocations to fail in drm_buddy under memory pressure)
>>
>> I only mention the first one for completeness. It's not a valid 
>> solution. Neither of the remaining two options are particularly 
>> appealing either.
>>
>> Regards,
>>   Felix
>>
>>
>> [   83.979486] ======================================================
>> [   83.986583] WARNING: possible circular locking dependency detected
>> [   83.993643] 5.19.0-kfd-fkuehlin #75 Not tainted
>> [   83.999044] ------------------------------------------------------
>> [   84.006088] kfdtest/3453 is trying to acquire lock:
>> [   84.011820] ffff9a998561e210 (&prange->lock){+.+.}-{3:3}, at: 
>> svm_range_cpu_invalidate_pagetables+0x90/0x5e0 [amdgpu]
>> [   84.023911]
>>                but task is already holding lock:
>> [   84.031608] ffffffffbcd929c0 
>> (mmu_notifier_invalidate_range_start){+.+.}-{0:0}, at: 
>> unmap_vmas+0x5/0x170
>> [   84.041992]
>>                which lock already depends on the new lock.
>>
>> [   84.052785]
>>                the existing dependency chain (in reverse order) is:
>> [   84.061993]
>>                -> #3 (mmu_notifier_invalidate_range_start){+.+.}-{0:0}:
>> [   84.071548]        fs_reclaim_acquire+0x6d/0xd0
>> [   84.076941]        kmem_cache_alloc_trace+0x34/0x760
>> [   84.082766]        alloc_workqueue_attrs+0x1b/0x50
>> [   84.088411]        workqueue_init+0x88/0x318
>> [   84.093533]        kernel_init_freeable+0x134/0x28f
>> [   84.099258]        kernel_init+0x16/0x130
>> [   84.104107]        ret_from_fork+0x1f/0x30
>> [   84.109038]
>>                -> #2 (fs_reclaim){+.+.}-{0:0}:
>> [   84.116348]        fs_reclaim_acquire+0xa1/0xd0
>> [   84.121697]        kmem_cache_alloc+0x2c/0x760
>> [   84.126948]        drm_block_alloc.isra.0+0x27/0x50 [drm_buddy]
>> [   84.133679]        split_block+0x4d/0x140 [drm_buddy]
>> [   84.139539]        drm_buddy_alloc_blocks+0x385/0x580 [drm_buddy]
>> [   84.146435]        amdgpu_vram_mgr_new+0x213/0x4f0 [amdgpu]
>> [   84.153399]        ttm_resource_alloc+0x31/0x80 [ttm]
>> [   84.159366]        ttm_bo_mem_space+0x8f/0x230 [ttm]
>> [   84.165169]        ttm_bo_validate+0xc5/0x170 [ttm]
>> [   84.170872]        ttm_bo_init_reserved+0x1a6/0x230 [ttm]
>> [   84.177075]        amdgpu_bo_create+0x1a0/0x510 [amdgpu]
>> [   84.183600]        amdgpu_bo_create_reserved+0x188/0x1e0 [amdgpu]
>> [   84.190803]        amdgpu_bo_create_kernel_at+0x64/0x200 [amdgpu]
>> [   84.197994]        amdgpu_ttm_init+0x420/0x4c0 [amdgpu]
>> [   84.204301]        gmc_v10_0_sw_init+0x33a/0x530 [amdgpu]
>> [   84.210813]        amdgpu_device_init.cold+0x10d4/0x17a1 [amdgpu]
>> [   84.218077]        amdgpu_driver_load_kms+0x15/0x110 [amdgpu]
>> [   84.224919]        amdgpu_pci_probe+0x142/0x350 [amdgpu]
>> [   84.231313]        local_pci_probe+0x40/0x80
>> [   84.236437]        work_for_cpu_fn+0x10/0x20
>> [   84.241500]        process_one_work+0x270/0x5a0
>> [   84.246805]        worker_thread+0x203/0x3c0
>> [   84.251828]        kthread+0xea/0x110
>> [   84.256229]        ret_from_fork+0x1f/0x30
>> [   84.261061]
>>                -> #1 (&mgr->lock){+.+.}-{3:3}:
>> [   84.268156]        __mutex_lock+0x9a/0xf30
>> [   84.272967]        amdgpu_vram_mgr_new+0x14a/0x4f0 [amdgpu]
>> [   84.279752]        ttm_resource_alloc+0x31/0x80 [ttm]
>> [   84.285602]        ttm_bo_mem_space+0x8f/0x230 [ttm]
>> [   84.291321]        ttm_bo_validate+0xc5/0x170 [ttm]
>> [   84.296939]        ttm_bo_init_reserved+0xe2/0x230 [ttm]
>> [   84.302969]        amdgpu_bo_create+0x1a0/0x510 [amdgpu]
>> [   84.309297]        amdgpu_bo_create_vm+0x2e/0x80 [amdgpu]
>> [   84.315656]        amdgpu_vm_pt_create+0xf5/0x270 [amdgpu]
>> [   84.322090]        amdgpu_vm_ptes_update+0x6c4/0x8f0 [amdgpu]
>> [   84.328793]        amdgpu_vm_update_range+0x29b/0x730 [amdgpu]
>> [   84.335537]        svm_range_validate_and_map+0xc78/0x1390 [amdgpu]
>> [   84.342734]        svm_range_set_attr+0xc74/0x1170 [amdgpu]
>> [   84.349222]        kfd_ioctl+0x189/0x5c0 [amdgpu]
>> [   84.354808]        __x64_sys_ioctl+0x80/0xb0
>> [   84.359738]        do_syscall_64+0x35/0x80
>> [   84.364481]        entry_SYSCALL_64_after_hwframe+0x63/0xcd
>> [   84.370687]
>>                -> #0 (&prange->lock){+.+.}-{3:3}:
>> [   84.377864]        __lock_acquire+0x12ed/0x27e0
>> [   84.383027]        lock_acquire+0xca/0x2e0
>> [   84.387759]        __mutex_lock+0x9a/0xf30
>> [   84.392491] svm_range_cpu_invalidate_pagetables+0x90/0x5e0 [amdgpu]
>> [   84.400345] __mmu_notifier_invalidate_range_start+0x1d3/0x230
>> [   84.407410]        unmap_vmas+0x162/0x170
>> [   84.412126]        unmap_region+0xa8/0x110
>> [   84.416905]        __do_munmap+0x209/0x4f0
>> [   84.421680]        __vm_munmap+0x78/0x120
>> [   84.426365]        __x64_sys_munmap+0x17/0x20
>> [   84.431392]        do_syscall_64+0x35/0x80
>> [   84.436164]        entry_SYSCALL_64_after_hwframe+0x63/0xcd
>> [   84.442405]
>>                other info that might help us debug this:
>>
>> [   84.452431] Chain exists of:
>>                  &prange->lock --> fs_reclaim --> 
>> mmu_notifier_invalidate_range_start
>>
>> [   84.466395]  Possible unsafe locking scenario:
>>
>> [   84.473720]        CPU0                    CPU1
>> [   84.479020]        ----                    ----
>> [   84.484296]   lock(mmu_notifier_invalidate_range_start);
>> [   84.490333]                                lock(fs_reclaim);
>> [   84.496705] lock(mmu_notifier_invalidate_range_start);
>> [   84.505246]   lock(&prange->lock);
>> [   84.509361]
>>                 *** DEADLOCK ***
>>
>> [   84.517360] 2 locks held by kfdtest/3453:
>> [   84.522060]  #0: ffff9a99821ec4a8 (&mm->mmap_lock#2){++++}-{3:3}, 
>> at: __do_munmap+0x417/0x4f0
>> [   84.531287]  #1: ffffffffbcd929c0 
>> (mmu_notifier_invalidate_range_start){+.+.}-{0:0}, at: 
>> unmap_vmas+0x5/0x170
>> [   84.541896]
>>                stack backtrace:
>> [   84.547630] CPU: 3 PID: 3453 Comm: kfdtest Not tainted 
>> 5.19.0-kfd-fkuehlin #75
>> [   84.555537] Hardware name: To Be Filled By O.E.M. To Be Filled By 
>> O.E.M./EPYCD8-2T, BIOS P2.60 04/10/2020
>> [   84.565788] Call Trace:
>> [   84.568925]  <TASK>
>> [   84.571702]  dump_stack_lvl+0x45/0x59
>> [   84.576034]  check_noncircular+0xfe/0x110
>> [   84.580715]  ? kernel_text_address+0x6d/0xe0
>> [   84.585652]  __lock_acquire+0x12ed/0x27e0
>> [   84.590340]  lock_acquire+0xca/0x2e0
>> [   84.594595]  ? svm_range_cpu_invalidate_pagetables+0x90/0x5e0 
>> [amdgpu]
>> [   84.602338]  __mutex_lock+0x9a/0xf30
>> [   84.606714]  ? svm_range_cpu_invalidate_pagetables+0x90/0x5e0 
>> [amdgpu]
>> [   84.614262]  ? svm_range_cpu_invalidate_pagetables+0x90/0x5e0 
>> [amdgpu]
>> [   84.621806]  ? svm_range_cpu_invalidate_pagetables+0x90/0x5e0 
>> [amdgpu]
>> [   84.629353]  svm_range_cpu_invalidate_pagetables+0x90/0x5e0 [amdgpu]
>> [   84.636742]  ? lock_release+0x139/0x2b0
>> [   84.641374] __mmu_notifier_invalidate_range_start+0x1d3/0x230
>> [   84.647976]  unmap_vmas+0x162/0x170
>> [   84.652203]  unmap_region+0xa8/0x110
>> [   84.656503]  __do_munmap+0x209/0x4f0
>> [   84.660792]  __vm_munmap+0x78/0x120
>> [   84.664977]  __x64_sys_munmap+0x17/0x20
>> [   84.669499]  do_syscall_64+0x35/0x80
>> [   84.673755]  entry_SYSCALL_64_after_hwframe+0x63/0xcd
>> [   84.679485] RIP: 0033:0x7f32872eb97b
>> [   84.683738] Code: 8b 15 19 35 0d 00 f7 d8 64 89 02 48 c7 c0 ff ff 
>> ff ff eb 89 66 2e 0f 1f 84 00 00 00 00 00 90 f3 0f 1e fa b8 0b 00 00 
>> 00 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d e5 34 0d 00 f7 d8 64 
>> 89 01 48
>> [   84.703915] RSP: 002b:00007fffb06c4508 EFLAGS: 00000246 ORIG_RAX: 
>> 000000000000000b
>> [   84.712205] RAX: ffffffffffffffda RBX: 0000000000000001 RCX: 
>> 00007f32872eb97b
>> [   84.720072] RDX: 0000000004000000 RSI: 0000000004000000 RDI: 
>> 00007f32831ae000
>> [   84.727944] RBP: 00007fffb06c4750 R08: 00007fffb06c4548 R09: 
>> 000055e7570ad230
>> [   84.735809] R10: 000055e757088010 R11: 0000000000000246 R12: 
>> 000055e75453cefa
>> [   84.743688] R13: 0000000000000000 R14: 0000000000000021 R15: 
>> 0000000000000000
>> [   84.751584]  </TASK>
>>
>>
>