[PATCH] drm/amd/amdgpu: fix a potential deadlock in gpu reset

Wed May 19 11:39:29 UTC 2021

On 2021-05-17 6:55 a.m., Christian König wrote:
> Am 17.05.21 um 12:52 schrieb Lang Yu:
>> When amdgpu_ib_ring_tests failed, the reset logic called
>> amdgpu_device_ip_suspend twice, then deadlock occurred.
>>
>> Deadlock log:
>> [  805.655192] amdgpu 0000:04:00.0: amdgpu: ib ring test failed (-110).
>> [  806.011571] [drm] Register(0) [mmUVD_POWER_STATUS] failed to reach 
>> value 0x00000001 != 0x00000002
>> [  806.280139] [drm] Register(0) [mmUVD_POWER_STATUS] failed to reach 
>> value 0x00000001 != 0x00000002
>> [  806.290952] [drm] free PSP TMR buffer
>>
>> [  806.319406] ============================================
>> [  806.320315] WARNING: possible recursive locking detected
>> [  806.321225] 5.11.0-custom #1 Tainted: G        W  OEL
>> [  806.322135] --------------------------------------------
>> [  806.323043] cat/2593 is trying to acquire lock:
>> [  806.323825] ffff888136b1cdc8 (&adev->dm.dc_lock){+.+.}-{3:3}, at: 
>> dm_suspend+0xb8/0x1d0 [amdgpu]
>> [  806.325668]
>>                 but task is already holding lock:
>> [  806.326664] ffff888136b1cdc8 (&adev->dm.dc_lock){+.+.}-{3:3}, at: 
>> dm_suspend+0xb8/0x1d0 [amdgpu]
>> [  806.328430]
>>                 other info that might help us debug this:
>> [  806.329539]  Possible unsafe locking scenario:
>>
>> [  806.330549]        CPU0
>> [  806.330983]        ----
>> [  806.331416]   lock(&adev->dm.dc_lock);
>> [  806.332086]   lock(&adev->dm.dc_lock);
>> [  806.332738]
>>                  *** DEADLOCK ***
>>
>> [  806.333747]  May be due to missing lock nesting notation
>>
>> [  806.334899] 3 locks held by cat/2593:
>> [  806.335537]  #0: ffff888100d3f1b8 (&attr->mutex){+.+.}-{3:3}, at: 
>> simple_attr_read+0x4e/0x110
>> [  806.337009]  #1: ffff888136b1fd78 (&adev->reset_sem){++++}-{3:3}, 
>> at: amdgpu_device_lock_adev+0x42/0x94 [amdgpu]
>> [  806.339018]  #2: ffff888136b1cdc8 (&adev->dm.dc_lock){+.+.}-{3:3}, 
>> at: dm_suspend+0xb8/0x1d0 [amdgpu]
>> [  806.340869]
>>                 stack backtrace:
>> [  806.341621] CPU: 6 PID: 2593 Comm: cat Tainted: G        W  OEL    
>> 5.11.0-custom #1
>> [  806.342921] Hardware name: AMD Celadon-CZN/Celadon-CZN, BIOS 
>> WLD0C23N_Weekly_20_12_2 12/23/2020
>> [  806.344413] Call Trace:
>> [  806.344849]  dump_stack+0x93/0xbd
>> [  806.345435]  __lock_acquire.cold+0x18a/0x2cf
>> [  806.346179]  lock_acquire+0xca/0x390
>> [  806.346807]  ? dm_suspend+0xb8/0x1d0 [amdgpu]
>> [  806.347813]  __mutex_lock+0x9b/0x930
>> [  806.348454]  ? dm_suspend+0xb8/0x1d0 [amdgpu]
>> [  806.349434]  ? amdgpu_device_indirect_rreg+0x58/0x70 [amdgpu]
>> [  806.350581]  ? _raw_spin_unlock_irqrestore+0x47/0x50
>> [  806.351437]  ? dm_suspend+0xb8/0x1d0 [amdgpu]
>> [  806.352437]  ? rcu_read_lock_sched_held+0x4f/0x80
>> [  806.353252]  ? rcu_read_lock_sched_held+0x4f/0x80
>> [  806.354064]  mutex_lock_nested+0x1b/0x20
>> [  806.354747]  ? mutex_lock_nested+0x1b/0x20
>> [  806.355457]  dm_suspend+0xb8/0x1d0 [amdgpu]
>> [  806.356427]  ? soc15_common_set_clockgating_state+0x17d/0x19 [amdgpu]
>> [  806.357736]  amdgpu_device_ip_suspend_phase1+0x78/0xd0 [amdgpu]
>> [  806.360394]  amdgpu_device_ip_suspend+0x21/0x70 [amdgpu]
>> [  806.362926]  amdgpu_device_pre_asic_reset+0xb3/0x270 [amdgpu]
>> [  806.365560]  amdgpu_device_gpu_recover.cold+0x679/0x8eb [amdgpu]
>> [  806.368331]  ? __pm_runtime_resume+0x60/0x80
>> [  806.370509]  gpu_recover_get+0x2e/0x60 [amdgpu]
>> [  806.372887]  simple_attr_read+0x6d/0x110
>> [  806.374966]  debugfs_attr_read+0x49/0x70
>> [  806.377046]  full_proxy_read+0x5f/0x90
>> [  806.379054]  vfs_read+0xa3/0x190
>> [  806.380969]  ksys_read+0x70/0xf0
>> [  806.382833]  __x64_sys_read+0x1a/0x20
>> [  806.384803]  do_syscall_64+0x38/0x90
>> [  806.386743]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
>> [  806.388946] RIP: 0033:0x7fb084ea1142
>> [  806.390914] Code: c0 e9 c2 fe ff ff 50 48 8d 3d 3a ca 0a 00 e8 f5 
>> 19 02 00 0f 1f 44 00 00 f3 0f 1e fa 64 8b 04 25 18 00 00 00 85 c0 75 
>> 10 0f 05 <48> 3d 00 f0 ff ff 77 56 c3 0f 1f 44 00 00 48 83 ec 28 48 89 
>> 54 24
>> [  806.395496] RSP: 002b:00007fffde50ee08 EFLAGS: 00000246 ORIG_RAX: 
>> 0000000000000000
>> [  806.398298] RAX: ffffffffffffffda RBX: 0000000000020000 RCX: 
>> 00007fb084ea1142
>> [  806.401063] RDX: 0000000000020000 RSI: 00007fb0844ff000 RDI: 
>> 0000000000000003
>> [  806.403793] RBP: 00007fb0844ff000 R08: 00007fb0844fe010 R09: 
>> 0000000000000000
>> [  806.406516] R10: 0000000000000022 R11: 0000000000000246 R12: 
>> 0000555d3d3b51f0
>> [  806.409246] R13: 0000000000000003 R14: 0000000000020000 R15: 
>> 0000000000020000
> 
> I think we should shorten the backtrace here a bit.
> 
>>
>> Signed-off-by: Lang Yu <Lang.Yu at amd.com>
> 
> Looks sane to me, but Andrey should probably also take a look.
> 
> Acked-by: Christian König <christian.koenig at amd.com>

Yes, seems like a typo...

Reviewed-by: Andrey Grodzovsky andrey.grodzovsky at amd.com

Andrey

> 
>> ---
>>   drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 1 -
>>   1 file changed, 1 deletion(-)
>>
>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c 
>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>> index 7c6c435e5d02..ff341154394e 100644
>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>> @@ -4476,7 +4476,6 @@ int amdgpu_do_asic_reset(struct list_head 
>> *device_list_handle,
>>               r = amdgpu_ib_ring_tests(tmp_adev);
>>               if (r) {
>>                   dev_err(tmp_adev->dev, "ib ring test failed 
>> (%d).\n", r);
>> -                r = amdgpu_device_ip_suspend(tmp_adev);
>>                   need_full_reset = true;
>>                   r = -EAGAIN;
>>                   goto end;
>