[PATCH] drm/amdgpu: Fix recursive locking warning
Felix Kuehling
felix.kuehling at amd.com
Fri Feb 4 16:23:07 UTC 2022
Am 2022-02-04 um 02:13 schrieb Christian König:
> Am 04.02.22 um 04:11 schrieb Rajneesh Bhardwaj:
>> Noticed the below warning while running a pytorch workload on vega10
>> GPUs. Change to trylock to avoid conflicts with already held reservation
>> locks.
>>
>> [ +0.000003] WARNING: possible recursive locking detected
>> [ +0.000003] 5.13.0-kfd-rajneesh #1030 Not tainted
>> [ +0.000004] --------------------------------------------
>> [ +0.000002] python/4822 is trying to acquire lock:
>> [ +0.000004] ffff932cd9a259f8 (reservation_ww_class_mutex){+.+.}-{3:3},
>> at: amdgpu_bo_release_notify+0xc4/0x160 [amdgpu]
>> [ +0.000203]
>> but task is already holding lock:
>> [ +0.000003] ffff932cbb7181f8 (reservation_ww_class_mutex){+.+.}-{3:3},
>> at: ttm_eu_reserve_buffers+0x270/0x470 [ttm]
>> [ +0.000017]
>> other info that might help us debug this:
>> [ +0.000002] Possible unsafe locking scenario:
>>
>> [ +0.000003] CPU0
>> [ +0.000002] ----
>> [ +0.000002] lock(reservation_ww_class_mutex);
>> [ +0.000004] lock(reservation_ww_class_mutex);
>> [ +0.000003]
>> *** DEADLOCK ***
>>
>> [ +0.000002] May be due to missing lock nesting notation
>>
>> [ +0.000003] 7 locks held by python/4822:
>> [ +0.000003] #0: ffff932c4ac028d0 (&process->mutex){+.+.}-{3:3}, at:
>> kfd_ioctl_map_memory_to_gpu+0x10b/0x320 [amdgpu]
>> [ +0.000232] #1: ffff932c55e830a8 (&info->lock#2){+.+.}-{3:3}, at:
>> amdgpu_amdkfd_gpuvm_map_memory_to_gpu+0x64/0xf60 [amdgpu]
>> [ +0.000241] #2: ffff932cc45b5e68 (&(*mem)->lock){+.+.}-{3:3}, at:
>> amdgpu_amdkfd_gpuvm_map_memory_to_gpu+0xdf/0xf60 [amdgpu]
>> [ +0.000236] #3: ffffb2b35606fd28
>> (reservation_ww_class_acquire){+.+.}-{0:0}, at:
>> amdgpu_amdkfd_gpuvm_map_memory_to_gpu+0x232/0xf60 [amdgpu]
>> [ +0.000235] #4: ffff932cbb7181f8
>> (reservation_ww_class_mutex){+.+.}-{3:3}, at:
>> ttm_eu_reserve_buffers+0x270/0x470 [ttm]
>> [ +0.000015] #5: ffffffffc045f700 (*(sspp++)){....}-{0:0}, at:
>> drm_dev_enter+0x5/0xa0 [drm]
>> [ +0.000038] #6: ffff932c52da7078 (&vm->eviction_lock){+.+.}-{3:3},
>> at: amdgpu_vm_bo_update_mapping+0xd5/0x4f0 [amdgpu]
>> [ +0.000195]
>> stack backtrace:
>> [ +0.000003] CPU: 11 PID: 4822 Comm: python Not tainted
>> 5.13.0-kfd-rajneesh #1030
>> [ +0.000005] Hardware name: GIGABYTE MZ01-CE0-00/MZ01-CE0-00, BIOS F02
>> 08/29/2018
>> [ +0.000003] Call Trace:
>> [ +0.000003] dump_stack+0x6d/0x89
>> [ +0.000010] __lock_acquire+0xb93/0x1a90
>> [ +0.000009] lock_acquire+0x25d/0x2d0
>> [ +0.000005] ? amdgpu_bo_release_notify+0xc4/0x160 [amdgpu]
>> [ +0.000184] ? lock_is_held_type+0xa2/0x110
>> [ +0.000006] ? amdgpu_bo_release_notify+0xc4/0x160 [amdgpu]
>> [ +0.000184] __ww_mutex_lock.constprop.17+0xca/0x1060
>> [ +0.000007] ? amdgpu_bo_release_notify+0xc4/0x160 [amdgpu]
>> [ +0.000183] ? lock_release+0x13f/0x270
>> [ +0.000005] ? lock_is_held_type+0xa2/0x110
>> [ +0.000006] ? amdgpu_bo_release_notify+0xc4/0x160 [amdgpu]
>> [ +0.000183] amdgpu_bo_release_notify+0xc4/0x160 [amdgpu]
>> [ +0.000185] ttm_bo_release+0x4c6/0x580 [ttm]
>> [ +0.000010] amdgpu_bo_unref+0x1a/0x30 [amdgpu]
>> [ +0.000183] amdgpu_vm_free_table+0x76/0xa0 [amdgpu]
>> [ +0.000189] amdgpu_vm_free_pts+0xb8/0xf0 [amdgpu]
>> [ +0.000189] amdgpu_vm_update_ptes+0x411/0x770 [amdgpu]
>> [ +0.000191] amdgpu_vm_bo_update_mapping+0x324/0x4f0 [amdgpu]
>> [ +0.000191] amdgpu_vm_bo_update+0x251/0x610 [amdgpu]
>> [ +0.000191] update_gpuvm_pte+0xcc/0x290 [amdgpu]
>> [ +0.000229] ? amdgpu_vm_bo_map+0xd7/0x130 [amdgpu]
>> [ +0.000190] amdgpu_amdkfd_gpuvm_map_memory_to_gpu+0x912/0xf60
>> [amdgpu]
>> [ +0.000234] kfd_ioctl_map_memory_to_gpu+0x182/0x320 [amdgpu]
>> [ +0.000218] kfd_ioctl+0x2b9/0x600 [amdgpu]
>> [ +0.000216] ? kfd_ioctl_unmap_memory_from_gpu+0x270/0x270 [amdgpu]
>> [ +0.000216] ? lock_release+0x13f/0x270
>> [ +0.000006] ? __fget_files+0x107/0x1e0
>> [ +0.000007] __x64_sys_ioctl+0x8b/0xd0
>> [ +0.000007] do_syscall_64+0x36/0x70
>> [ +0.000004] entry_SYSCALL_64_after_hwframe+0x44/0xae
>> [ +0.000007] RIP: 0033:0x7fbff90a7317
>> [ +0.000004] Code: b3 66 90 48 8b 05 71 4b 2d 00 64 c7 00 26 00 00 00
>> 48 c7 c0 ff ff ff ff c3 66 2e 0f 1f 84 00 00 00 00 00 b8 10 00 00 00 0f
>> 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d 41 4b 2d 00 f7 d8 64 89 01 48
>> [ +0.000005] RSP: 002b:00007fbe301fe648 EFLAGS: 00000246 ORIG_RAX:
>> 0000000000000010
>> [ +0.000006] RAX: ffffffffffffffda RBX: 00007fbcc402d820 RCX:
>> 00007fbff90a7317
>> [ +0.000003] RDX: 00007fbe301fe690 RSI: 00000000c0184b18 RDI:
>> 0000000000000004
>> [ +0.000003] RBP: 00007fbe301fe690 R08: 0000000000000000 R09:
>> 00007fbcc402d880
>> [ +0.000003] R10: 0000000002001000 R11: 0000000000000246 R12:
>> 00000000c0184b18
>> [ +0.000003] R13: 0000000000000004 R14: 00007fbf689593a0 R15:
>> 00007fbcc402d820
>>
>> Cc: Christian König <christian.koenig at amd.com>
>> Cc: Felix Kuehling <Felix.Kuehling at amd.com>
>> Cc: Alex Deucher <Alexander.Deucher at amd.com>
>>
>> Fixes: 627b92ef9d7c ("drm/amdgpu: Wipe all VRAM on free when RAS is
>> enabled")
>> Signed-off-by: Rajneesh Bhardwaj <rajneesh.bhardwaj at amd.com>
>
> The fixes tag is not necessarily correct, I would remove that.
>
> But apart from that the patch is Reviewed-by: Christian König
> <christian.koenig at amd.com>.
I suggested the Fixes tag since it was my patch that introduced the
problem. Without my patch, page table BOs wouldn't be cleared here, and
it wouldn't get that recursive lock warning.
Either way, the patch is also
Reviewed-by: Felix Kuehling <Felix.Kuehling at amd.com>
>
> Thanks,
> Christian.
>
>> ---
>> drivers/gpu/drm/amd/amdgpu/amdgpu_object.c | 3 ++-
>> 1 file changed, 2 insertions(+), 1 deletion(-)
>>
>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_object.c
>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_object.c
>> index 36bb41b027ec..6ccd2be685f5 100644
>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_object.c
>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_object.c
>> @@ -1306,7 +1306,8 @@ void amdgpu_bo_release_notify(struct
>> ttm_buffer_object *bo)
>> !(abo->flags & AMDGPU_GEM_CREATE_VRAM_WIPE_ON_RELEASE))
>> return;
>> - dma_resv_lock(bo->base.resv, NULL);
>> + if (WARN_ON_ONCE(!dma_resv_trylock(bo->base.resv)))
>> + return;
>> r = amdgpu_fill_buffer(abo, AMDGPU_POISON, bo->base.resv,
>> &fence);
>> if (!WARN_ON(r)) {
>
More information about the amd-gfx
mailing list