[PATCH v3] drm/amdgpu: annotate a false positive recursive locking

Tue Aug 11 08:01:28 UTC 2020

[AMD Public Use]

Hi Christian,

Since it's out of amdgpu driver scope, so I just attached my patch in this email thread instead of sending it by git sent-email.

Regards,
Guchun

-----Original Message-----
From: Koenig, Christian <Christian.Koenig at amd.com> 
Sent: Tuesday, August 11, 2020 3:52 PM
To: Chen, Guchun <Guchun.Chen at amd.com>; Li, Dennis <Dennis.Li at amd.com>; amd-gfx at lists.freedesktop.org; Deucher, Alexander <Alexander.Deucher at amd.com>; Kuehling, Felix <Felix.Kuehling at amd.com>; Zhang, Hawking <Hawking.Zhang at amd.com>; daniel at ffwll.ch
Subject: Re: [PATCH v3] drm/amdgpu: annotate a false positive recursive locking

Nice catch, it's not often that we stumble over something like this :)

Can you provide a patch to fix this? Problem is that we probably won't be able to push it to the AMD servers, but I can probably merge it through drm-misc-next.

Until this is fixed feel free to commit the patch with the {} in place.

Thanks,
Christian.

Am 11.08.20 um 09:44 schrieb Chen, Guchun:
> [AMD Public Use]
>
> # define down_write_nest_lock(sem, nest_lock)			\
> do {								\
> 	typecheck(struct lockdep_map *, &(nest_lock)->dep_map);	\
> 	_down_write_nest_lock(sem, &(nest_lock)->dep_map);	\
> } while (0);
>
> Looks the ';' after while (0) is the error point. It should be dropped.
>
> Regards,
> Guchun
>
> -----Original Message-----
> From: Koenig, Christian <Christian.Koenig at amd.com>
> Sent: Tuesday, August 11, 2020 3:41 PM
> To: Chen, Guchun <Guchun.Chen at amd.com>; Li, Dennis 
> <Dennis.Li at amd.com>; amd-gfx at lists.freedesktop.org; Deucher, Alexander 
> <Alexander.Deucher at amd.com>; Kuehling, Felix <Felix.Kuehling at amd.com>; 
> Zhang, Hawking <Hawking.Zhang at amd.com>; daniel at ffwll.ch
> Subject: Re: [PATCH v3] drm/amdgpu: annotate a false positive 
> recursive locking
>
> Am 11.08.20 um 09:36 schrieb Chen, Guchun:
>> [AMD Public Use]
>>
>>> -	down_write(&adev->reset_sem);
>>> +	if (hive) {
>>> +		down_write_nest_lock(&adev->reset_sem, &hive->hive_lock);
>>> +	} else
>>> +		down_write(&adev->reset_sem);
>> Coding style nit pick: You should drop the {} here.
>>
>> {} could not be dropped here, as down_write_nest_lock Is one macro with multiple lines, otherwise, build error would say 'else' missed one previous 'if'.
>> Instead of dropping, another {} should be added to else to include down_write(&adev->reset_sem), which makes the braces been balanced.
> Interesting bug, that is something which should not happen with macros in the Linux kernel. They protection by "do { ... } while(0)" is mandatory here.
>
> Let me take a look,
> Christian.
>
>> Regards,
>> Guchun
>>
>> -----Original Message-----
>> From: amd-gfx <amd-gfx-bounces at lists.freedesktop.org> On Behalf Of 
>> Christian König
>> Sent: Tuesday, August 11, 2020 2:53 PM
>> To: Li, Dennis <Dennis.Li at amd.com>; amd-gfx at lists.freedesktop.org; 
>> Deucher, Alexander <Alexander.Deucher at amd.com>; Kuehling, Felix 
>> <Felix.Kuehling at amd.com>; Zhang, Hawking <Hawking.Zhang at amd.com>; 
>> daniel at ffwll.ch
>> Subject: Re: [PATCH v3] drm/amdgpu: annotate a false positive 
>> recursive locking
>>
>> Am 11.08.20 um 04:12 schrieb Dennis Li:
>>> [  584.110304] ============================================
>>> [  584.110590] WARNING: possible recursive locking detected
>>> [  584.110876] 5.6.0-deli-v5.6-2848-g3f3109b0e75f #1 Tainted: G           OE
>>> [  584.111164] --------------------------------------------
>>> [  584.111456] kworker/38:1/553 is trying to acquire lock:
>>> [  584.111721] ffff9b15ff0a47a0 (&adev->reset_sem){++++}, at:
>>> amdgpu_device_gpu_recover+0x262/0x1030 [amdgpu] [  584.112112]
>>>                   but task is already holding lock:
>>> [  584.112673] ffff9b1603d247a0 (&adev->reset_sem){++++}, at:
>>> amdgpu_device_gpu_recover+0x262/0x1030 [amdgpu] [  584.113068]
>>>                   other info that might help us debug this:
>>> [  584.113689]  Possible unsafe locking scenario:
>>>
>>> [  584.114350]        CPU0
>>> [  584.114685]        ----
>>> [  584.115014]   lock(&adev->reset_sem);
>>> [  584.115349]   lock(&adev->reset_sem);
>>> [  584.115678]
>>>                    *** DEADLOCK ***
>>>
>>> [  584.116624]  May be due to missing lock nesting notation
>>>
>>> [  584.117284] 4 locks held by kworker/38:1/553:
>>> [  584.117616]  #0: ffff9ad635c1d348 ((wq_completion)events){+.+.},
>>> at: process_one_work+0x21f/0x630 [  584.117967]  #1: 
>>> ffffac708e1c3e58 ((work_completion)(&con->recovery_work)){+.+.}, at:
>>> process_one_work+0x21f/0x630 [  584.118358]  #2: ffffffffc1c2a5d0 
>>> (&tmp->hive_lock){+.+.}, at: amdgpu_device_gpu_recover+0xae/0x1030 [amdgpu] [  584.118786]  #3: ffff9b1603d247a0 (&adev->reset_sem){++++}, at: amdgpu_device_gpu_recover+0x262/0x1030 [amdgpu] [  584.119222]
>>>                   stack backtrace:
>>> [  584.119990] CPU: 38 PID: 553 Comm: kworker/38:1 Kdump: loaded Tainted: G           OE     5.6.0-deli-v5.6-2848-g3f3109b0e75f #1
>>> [  584.120782] Hardware name: Supermicro SYS-7049GP-TRT/X11DPG-QT, 
>>> BIOS 3.1 05/23/2019 [  584.121223] Workqueue: events 
>>> amdgpu_ras_do_recovery [amdgpu] [  584.121638] Call Trace:
>>> [  584.122050]  dump_stack+0x98/0xd5 [  584.122499]  
>>> __lock_acquire+0x1139/0x16e0 [  584.122931]  ?
>>> trace_hardirqs_on+0x3b/0xf0 [  584.123358]  ?
>>> cancel_delayed_work+0xa6/0xc0 [  584.123771]  
>>> lock_acquire+0xb8/0x1c0 [  584.124197]  ? 
>>> amdgpu_device_gpu_recover+0x262/0x1030 [amdgpu] [ 584.124599]  down_write+0x49/0x120 [  584.125032]  ?
>>> amdgpu_device_gpu_recover+0x262/0x1030 [amdgpu] [  584.125472]
>>> amdgpu_device_gpu_recover+0x262/0x1030 [amdgpu] [  584.125910]  ?
>>> amdgpu_ras_error_query+0x1b8/0x2a0 [amdgpu] [  584.126367]
>>> amdgpu_ras_do_recovery+0x159/0x190 [amdgpu] [  584.126789]
>>> process_one_work+0x29e/0x630 [  584.127208]  
>>> worker_thread+0x3c/0x3f0 [  584.127621]  ? 
>>> __kthread_parkme+0x61/0x90 [  584.128014]
>>> kthread+0x12f/0x150 [  584.128402]  ? process_one_work+0x630/0x630 [
>>> 584.128790]  ? kthread_park+0x90/0x90 [  584.129174]
>>> ret_from_fork+0x3a/0x50
>>>
>>> Each adev has owned lock_class_key to avoid false positive recursive 
>>> locking.
>>>
>>> v2:
>>> 1. register adev->lock_key into lockdep, otherwise lockdep will 
>>> report the below warning
>>>
>>> [ 1216.705820] BUG: key ffff890183b647d0 has not been registered!
>>> [ 1216.705924] ------------[ cut here ]------------ [ 1216.705972]
>>> DEBUG_LOCKS_WARN_ON(1) [ 1216.705997] WARNING: CPU: 20 PID: 541 at
>>> kernel/locking/lockdep.c:3743 lockdep_init_map+0x150/0x210
>>>
>>> v3:
>>> change to use down_write_nest_lock to annotate the false dead-lock 
>>> warning.
>>>
>>> Signed-off-by: Dennis Li <Dennis.Li at amd.com>
>>>
>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>>> index 62ecac97fbd2..8a55b0bc044a 100644
>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>>> @@ -4145,12 +4145,15 @@ static int amdgpu_do_asic_reset(struct amdgpu_hive_info *hive,
>>>     	return r;
>>>     }
>>>     
>>> -static bool amdgpu_device_lock_adev(struct amdgpu_device *adev)
>>> +static bool amdgpu_device_lock_adev(struct amdgpu_device *adev, 
>>> +struct amdgpu_hive_info *hive)
>>>     {
>>>     	if (atomic_cmpxchg(&adev->in_gpu_reset, 0, 1) != 0)
>>>     		return false;
>>>     
>>> -	down_write(&adev->reset_sem);
>>> +	if (hive) {
>>> +		down_write_nest_lock(&adev->reset_sem, &hive->hive_lock);
>>> +	} else
>>> +		down_write(&adev->reset_sem);
>> Coding style nit pick: You should drop the {} here.
>>
>> Apart from that the patch is Reviewed-by: Christian König 
>> <christian.koenig at amd.com>
>>
>>>     
>>>     	atomic_inc(&adev->gpu_reset_counter);
>>>     	switch (amdgpu_asic_reset_method(adev)) { @@ -4312,7 +4315,7 @@ 
>>> int amdgpu_device_gpu_recover(struct amdgpu_device *adev,
>>>     
>>>     	/* block all schedulers and reset given job's ring */
>>>     	list_for_each_entry(tmp_adev, device_list_handle, gmc.xgmi.head) {
>>> -		if (!amdgpu_device_lock_adev(tmp_adev)) {
>>> +		if (!amdgpu_device_lock_adev(tmp_adev, hive)) {
>>>     			DRM_INFO("Bailing on TDR for s_job:%llx, as another already in progress",
>>>     				  job ? job->base.id : -1);
>>>     			r = 0;
>> _______________________________________________
>> amd-gfx mailing list
>> amd-gfx at lists.freedesktop.org
>> https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Flis
>> t 
>> s.freedesktop.org%2Fmailman%2Flistinfo%2Famd-gfx&data=02%7C01%7Cg
>> u
>> chun.chen%40amd.com%7C0a3fd06c0e774002a0e908d83dc337a0%7C3dd8961fe488
>> 4
>> e608e11a82d994e183d%7C0%7C0%7C637327255952065510&sdata=rJ%2BbWsWJ
>> 6
>> NfPUinwUHxnRfVxYi%2Ft4JQCTtamMATrLws%3D&reserved=0
-------------- next part --------------
A non-text attachment was scrubbed...
Name: 0001-locking-rwsem-remove-redundant-semicolon-of-down_wri.patch
Type: application/octet-stream
Size: 901 bytes
Desc: 0001-locking-rwsem-remove-redundant-semicolon-of-down_wri.patch
URL: <https://lists.freedesktop.org/archives/amd-gfx/attachments/20200811/bdd2cd22/attachment.obj>