[PATCH] drm/amdgpu: Fix recursive locking warning
Christian König
christian.koenig at amd.com
Fri Feb 4 16:25:20 UTC 2022
Am 04.02.22 um 17:23 schrieb Felix Kuehling:
>
> Am 2022-02-04 um 02:13 schrieb Christian König:
>> Am 04.02.22 um 04:11 schrieb Rajneesh Bhardwaj:
>>> Noticed the below warning while running a pytorch workload on vega10
>>> GPUs. Change to trylock to avoid conflicts with already held
>>> reservation
>>> locks.
>>>
>>> [ +0.000003] WARNING: possible recursive locking detected
>>> [ +0.000003] 5.13.0-kfd-rajneesh #1030 Not tainted
>>> [ +0.000004] --------------------------------------------
>>> [ +0.000002] python/4822 is trying to acquire lock:
>>> [ +0.000004] ffff932cd9a259f8
>>> (reservation_ww_class_mutex){+.+.}-{3:3},
>>> at: amdgpu_bo_release_notify+0xc4/0x160 [amdgpu]
>>> [ +0.000203]
>>> but task is already holding lock:
>>> [ +0.000003] ffff932cbb7181f8
>>> (reservation_ww_class_mutex){+.+.}-{3:3},
>>> at: ttm_eu_reserve_buffers+0x270/0x470 [ttm]
>>> [ +0.000017]
>>> other info that might help us debug this:
>>> [ +0.000002] Possible unsafe locking scenario:
>>>
>>> [ +0.000003] CPU0
>>> [ +0.000002] ----
>>> [ +0.000002] lock(reservation_ww_class_mutex);
>>> [ +0.000004] lock(reservation_ww_class_mutex);
>>> [ +0.000003]
>>> *** DEADLOCK ***
>>>
>>> [ +0.000002] May be due to missing lock nesting notation
>>>
>>> [ +0.000003] 7 locks held by python/4822:
>>> [ +0.000003] #0: ffff932c4ac028d0 (&process->mutex){+.+.}-{3:3}, at:
>>> kfd_ioctl_map_memory_to_gpu+0x10b/0x320 [amdgpu]
>>> [ +0.000232] #1: ffff932c55e830a8 (&info->lock#2){+.+.}-{3:3}, at:
>>> amdgpu_amdkfd_gpuvm_map_memory_to_gpu+0x64/0xf60 [amdgpu]
>>> [ +0.000241] #2: ffff932cc45b5e68 (&(*mem)->lock){+.+.}-{3:3}, at:
>>> amdgpu_amdkfd_gpuvm_map_memory_to_gpu+0xdf/0xf60 [amdgpu]
>>> [ +0.000236] #3: ffffb2b35606fd28
>>> (reservation_ww_class_acquire){+.+.}-{0:0}, at:
>>> amdgpu_amdkfd_gpuvm_map_memory_to_gpu+0x232/0xf60 [amdgpu]
>>> [ +0.000235] #4: ffff932cbb7181f8
>>> (reservation_ww_class_mutex){+.+.}-{3:3}, at:
>>> ttm_eu_reserve_buffers+0x270/0x470 [ttm]
>>> [ +0.000015] #5: ffffffffc045f700 (*(sspp++)){....}-{0:0}, at:
>>> drm_dev_enter+0x5/0xa0 [drm]
>>> [ +0.000038] #6: ffff932c52da7078 (&vm->eviction_lock){+.+.}-{3:3},
>>> at: amdgpu_vm_bo_update_mapping+0xd5/0x4f0 [amdgpu]
>>> [ +0.000195]
>>> stack backtrace:
>>> [ +0.000003] CPU: 11 PID: 4822 Comm: python Not tainted
>>> 5.13.0-kfd-rajneesh #1030
>>> [ +0.000005] Hardware name: GIGABYTE MZ01-CE0-00/MZ01-CE0-00, BIOS F02
>>> 08/29/2018
>>> [ +0.000003] Call Trace:
>>> [ +0.000003] dump_stack+0x6d/0x89
>>> [ +0.000010] __lock_acquire+0xb93/0x1a90
>>> [ +0.000009] lock_acquire+0x25d/0x2d0
>>> [ +0.000005] ? amdgpu_bo_release_notify+0xc4/0x160 [amdgpu]
>>> [ +0.000184] ? lock_is_held_type+0xa2/0x110
>>> [ +0.000006] ? amdgpu_bo_release_notify+0xc4/0x160 [amdgpu]
>>> [ +0.000184] __ww_mutex_lock.constprop.17+0xca/0x1060
>>> [ +0.000007] ? amdgpu_bo_release_notify+0xc4/0x160 [amdgpu]
>>> [ +0.000183] ? lock_release+0x13f/0x270
>>> [ +0.000005] ? lock_is_held_type+0xa2/0x110
>>> [ +0.000006] ? amdgpu_bo_release_notify+0xc4/0x160 [amdgpu]
>>> [ +0.000183] amdgpu_bo_release_notify+0xc4/0x160 [amdgpu]
>>> [ +0.000185] ttm_bo_release+0x4c6/0x580 [ttm]
>>> [ +0.000010] amdgpu_bo_unref+0x1a/0x30 [amdgpu]
>>> [ +0.000183] amdgpu_vm_free_table+0x76/0xa0 [amdgpu]
>>> [ +0.000189] amdgpu_vm_free_pts+0xb8/0xf0 [amdgpu]
>>> [ +0.000189] amdgpu_vm_update_ptes+0x411/0x770 [amdgpu]
>>> [ +0.000191] amdgpu_vm_bo_update_mapping+0x324/0x4f0 [amdgpu]
>>> [ +0.000191] amdgpu_vm_bo_update+0x251/0x610 [amdgpu]
>>> [ +0.000191] update_gpuvm_pte+0xcc/0x290 [amdgpu]
>>> [ +0.000229] ? amdgpu_vm_bo_map+0xd7/0x130 [amdgpu]
>>> [ +0.000190] amdgpu_amdkfd_gpuvm_map_memory_to_gpu+0x912/0xf60
>>> [amdgpu]
>>> [ +0.000234] kfd_ioctl_map_memory_to_gpu+0x182/0x320 [amdgpu]
>>> [ +0.000218] kfd_ioctl+0x2b9/0x600 [amdgpu]
>>> [ +0.000216] ? kfd_ioctl_unmap_memory_from_gpu+0x270/0x270 [amdgpu]
>>> [ +0.000216] ? lock_release+0x13f/0x270
>>> [ +0.000006] ? __fget_files+0x107/0x1e0
>>> [ +0.000007] __x64_sys_ioctl+0x8b/0xd0
>>> [ +0.000007] do_syscall_64+0x36/0x70
>>> [ +0.000004] entry_SYSCALL_64_after_hwframe+0x44/0xae
>>> [ +0.000007] RIP: 0033:0x7fbff90a7317
>>> [ +0.000004] Code: b3 66 90 48 8b 05 71 4b 2d 00 64 c7 00 26 00 00 00
>>> 48 c7 c0 ff ff ff ff c3 66 2e 0f 1f 84 00 00 00 00 00 b8 10 00 00 00 0f
>>> 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d 41 4b 2d 00 f7 d8 64 89 01 48
>>> [ +0.000005] RSP: 002b:00007fbe301fe648 EFLAGS: 00000246 ORIG_RAX:
>>> 0000000000000010
>>> [ +0.000006] RAX: ffffffffffffffda RBX: 00007fbcc402d820 RCX:
>>> 00007fbff90a7317
>>> [ +0.000003] RDX: 00007fbe301fe690 RSI: 00000000c0184b18 RDI:
>>> 0000000000000004
>>> [ +0.000003] RBP: 00007fbe301fe690 R08: 0000000000000000 R09:
>>> 00007fbcc402d880
>>> [ +0.000003] R10: 0000000002001000 R11: 0000000000000246 R12:
>>> 00000000c0184b18
>>> [ +0.000003] R13: 0000000000000004 R14: 00007fbf689593a0 R15:
>>> 00007fbcc402d820
>>>
>>> Cc: Christian König <christian.koenig at amd.com>
>>> Cc: Felix Kuehling <Felix.Kuehling at amd.com>
>>> Cc: Alex Deucher <Alexander.Deucher at amd.com>
>>>
>>> Fixes: 627b92ef9d7c ("drm/amdgpu: Wipe all VRAM on free when RAS is
>>> enabled")
>>> Signed-off-by: Rajneesh Bhardwaj <rajneesh.bhardwaj at amd.com>
>>
>> The fixes tag is not necessarily correct, I would remove that.
>>
>> But apart from that the patch is Reviewed-by: Christian König
>> <christian.koenig at amd.com>.
>
> I suggested the Fixes tag since it was my patch that introduced the
> problem. Without my patch, page table BOs wouldn't be cleared here,
> and it wouldn't get that recursive lock warning.
Yeah, but the problem existed before that. E.g. it can happen that we
drop the last reference during validation as well.
So this is valuable to backport even without your patch.
Regards,
Christian.
>
> Either way, the patch is also
>
> Reviewed-by: Felix Kuehling <Felix.Kuehling at amd.com>
>
>
>>
>> Thanks,
>> Christian.
>>
>>> ---
>>> drivers/gpu/drm/amd/amdgpu/amdgpu_object.c | 3 ++-
>>> 1 file changed, 2 insertions(+), 1 deletion(-)
>>>
>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_object.c
>>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_object.c
>>> index 36bb41b027ec..6ccd2be685f5 100644
>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_object.c
>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_object.c
>>> @@ -1306,7 +1306,8 @@ void amdgpu_bo_release_notify(struct
>>> ttm_buffer_object *bo)
>>> !(abo->flags & AMDGPU_GEM_CREATE_VRAM_WIPE_ON_RELEASE))
>>> return;
>>> - dma_resv_lock(bo->base.resv, NULL);
>>> + if (WARN_ON_ONCE(!dma_resv_trylock(bo->base.resv)))
>>> + return;
>>> r = amdgpu_fill_buffer(abo, AMDGPU_POISON, bo->base.resv,
>>> &fence);
>>> if (!WARN_ON(r)) {
>>
More information about the dri-devel
mailing list