[PATCH] drm/amd/dispaly: fix deadlock issue in amdgpu reset

Tue Mar 23 11:38:48 UTC 2021

+ Harry and Nick

On 2021-03-22 9:42 p.m., Yu, Lang wrote:
> [AMD Official Use Only - Internal Distribution Only]
> 
> 
> 
> -----Original Message-----
> From: Grodzovsky, Andrey <Andrey.Grodzovsky at amd.com>
> Sent: Monday, March 22, 2021 11:01 PM
> To: Yu, Lang <Lang.Yu at amd.com>; amd-gfx at lists.freedesktop.org
> Cc: Deucher, Alexander <Alexander.Deucher at amd.com>; Huang, Ray <Ray.Huang at amd.com>
> Subject: Re: [PATCH] drm/amd/dispaly: fix deadlock issue in amdgpu reset
> 
> 
> 
> On 2021-03-22 4:11 a.m., Lang Yu wrote:
>> In amdggpu reset, while dm.dc_lock is held by dm_suspend,
>> handle_hpd_rx_irq tries to acquire it. Deadlock occurred!
>>
>> Deadlock log:
>>
>> [  104.528304] amdgpu 0000:03:00.0: amdgpu: GPU reset begin!
>>
>> [  104.640084] ======================================================
>> [  104.640092] WARNING: possible circular locking dependency detected
>> [  104.640099] 5.11.0-custom #1 Tainted: G        W   E
>> [  104.640107] ------------------------------------------------------
>> [  104.640114] cat/1158 is trying to acquire lock:
>> [  104.640120] ffff88810a09ce00
>> ((work_completion)(&lh->work)){+.+.}-{0:0}, at: __flush_work+0x2e3/0x450 [  104.640144]
>>                  but task is already holding lock:
>> [  104.640151] ffff88810a09cc70 (&adev->dm.dc_lock){+.+.}-{3:3}, at:
>> dm_suspend+0xb2/0x1d0 [amdgpu] [  104.640581]
>>                  which lock already depends on the new lock.
>>
>> [  104.640590]
>>                  the existing dependency chain (in reverse order) is:
>> [  104.640598]
>>                  -> #2 (&adev->dm.dc_lock){+.+.}-{3:3}:
>> [  104.640611]        lock_acquire+0xca/0x390
>> [  104.640623]        __mutex_lock+0x9b/0x930
>> [  104.640633]        mutex_lock_nested+0x1b/0x20
>> [  104.640640]        handle_hpd_rx_irq+0x9b/0x1c0 [amdgpu]
>> [  104.640959]        dm_irq_work_func+0x4e/0x60 [amdgpu]
>> [  104.641264]        process_one_work+0x2a7/0x5b0
>> [  104.641275]        worker_thread+0x4a/0x3d0
>> [  104.641283]        kthread+0x125/0x160
>> [  104.641290]        ret_from_fork+0x22/0x30
>> [  104.641300]
>>                  -> #1 (&aconnector->hpd_lock){+.+.}-{3:3}:
>> [  104.641312]        lock_acquire+0xca/0x390
>> [  104.641321]        __mutex_lock+0x9b/0x930
>> [  104.641328]        mutex_lock_nested+0x1b/0x20
>> [  104.641336]        handle_hpd_rx_irq+0x67/0x1c0 [amdgpu]
>> [  104.641635]        dm_irq_work_func+0x4e/0x60 [amdgpu]
>> [  104.641931]        process_one_work+0x2a7/0x5b0
>> [  104.641940]        worker_thread+0x4a/0x3d0
>> [  104.641948]        kthread+0x125/0x160
>> [  104.641954]        ret_from_fork+0x22/0x30
>> [  104.641963]
>>                  -> #0 ((work_completion)(&lh->work)){+.+.}-{0:0}:
>> [  104.641975]        check_prev_add+0x94/0xbf0
>> [  104.641983]        __lock_acquire+0x130d/0x1ce0
>> [  104.641992]        lock_acquire+0xca/0x390
>> [  104.642000]        __flush_work+0x303/0x450
>> [  104.642008]        flush_work+0x10/0x20
>> [  104.642016]        amdgpu_dm_irq_suspend+0x93/0x100 [amdgpu]
>> [  104.642312]        dm_suspend+0x181/0x1d0 [amdgpu]
>> [  104.642605]        amdgpu_device_ip_suspend_phase1+0x8a/0x100 [amdgpu]
>> [  104.642835]        amdgpu_device_ip_suspend+0x21/0x70 [amdgpu]
>> [  104.643066]        amdgpu_device_pre_asic_reset+0x1bd/0x1d2 [amdgpu]
>> [  104.643403]        amdgpu_device_gpu_recover.cold+0x5df/0xa9d [amdgpu]
>> [  104.643715]        gpu_recover_get+0x2e/0x60 [amdgpu]
>> [  104.643951]        simple_attr_read+0x6d/0x110
>> [  104.643960]        debugfs_attr_read+0x49/0x70
>> [  104.643970]        full_proxy_read+0x5f/0x90
>> [  104.643979]        vfs_read+0xa3/0x190
>> [  104.643986]        ksys_read+0x70/0xf0
>> [  104.643992]        __x64_sys_read+0x1a/0x20
>> [  104.643999]        do_syscall_64+0x38/0x90
>> [  104.644007]        entry_SYSCALL_64_after_hwframe+0x44/0xa9
>> [  104.644017]
>>                  other info that might help us debug this:
>>
>> [  104.644026] Chain exists of:
>>                    (work_completion)(&lh->work) -->
>> &aconnector->hpd_lock --> &adev->dm.dc_lock
>>
>> [  104.644043]  Possible unsafe locking scenario:
>>
>> [  104.644049]        CPU0                    CPU1
>> [  104.644055]        ----                    ----
>> [  104.644060]   lock(&adev->dm.dc_lock);
>> [  104.644066]                                lock(&aconnector->hpd_lock);
>> [  104.644075]                                lock(&adev->dm.dc_lock);
>> [  104.644083]   lock((work_completion)(&lh->work));
>> [  104.644090]
>>                   *** DEADLOCK ***
>>
>> [  104.644096] 3 locks held by cat/1158:
>> [  104.644103]  #0: ffff88810d0e4eb8 (&attr->mutex){+.+.}-{3:3}, at:
>> simple_attr_read+0x4e/0x110 [  104.644119]  #1: ffff88810a0a1600
>> (&adev->reset_sem){++++}-{3:3}, at: amdgpu_device_lock_adev+0x42/0x94
>> [amdgpu] [  104.644489]  #2: ffff88810a09cc70
>> (&adev->dm.dc_lock){+.+.}-{3:3}, at: dm_suspend+0xb2/0x1d0 [amdgpu]
>>
>> Signed-off-by: Lang Yu <Lang.Yu at amd.com>
>> ---
>>    drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm.c | 6 ++++--
>>    1 file changed, 4 insertions(+), 2 deletions(-)
>>
>> diff --git a/drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm.c
>> b/drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm.c
>> index e176ea84d75b..8727488df769 100644
>> --- a/drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm.c
>> +++ b/drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm.c
>> @@ -2657,13 +2657,15 @@ static void handle_hpd_rx_irq(void *param)
>>    		}
>>    	}
>>    
>> -	mutex_lock(&adev->dm.dc_lock);
>> +	if (!amdgpu_in_reset(adev))
>> +		mutex_lock(&adev->dm.dc_lock);
>>    #ifdef CONFIG_DRM_AMD_DC_HDCP
>>    	result = dc_link_handle_hpd_rx_irq(dc_link, &hpd_irq_data, NULL);
>>    #else
>>    	result = dc_link_handle_hpd_rx_irq(dc_link, NULL, NULL);
>>    #endif
>> -	mutex_unlock(&adev->dm.dc_lock);
>> +	if (!amdgpu_in_reset(adev))
>> +		mutex_unlock(&adev->dm.dc_lock);
> 
> Why is it ok to stop locking if in GPU reset ?
> 
> Andrey
> 
> If stopping locking in GPU reset, GPU reset works, though it's not safe.
> If locking in GPU reset, deadlock occurred, GPU reset never works.
> Display guys should give a perfect fix. It has been blocked for a long time,
> it's time to take some actions.
> 
> Regards,
> Lang

Harry, Nick - In dm_suspend, would it be possible to drop dm->dc_lock 
before calling amdgpu_dm_irq_suspend and then locking again ? Problem
here is that amdgpu_dm_irq_suspend is flushing hpd_rx work while holding
dc_lock but as far as I remember you only need to grab dc_lock when
operating on streams and  planes/surfaces and I don't see any of this
in amdgpu_dm_irq_suspend.

Andrey

>>    
>>    out:
>>    	if (result && !is_mst_root_connector) {
>>