[PATCH] drm/amdgpu: annotate a false positive recursive locking

Fri Aug 7 07:08:05 UTC 2020

Am 2020-08-07 um 2:57 a.m. schrieb Christian König:
[snip]
> That's a really good argument, but I still hesitate to merge this
> patch. How severe is the lockdep splat?

I argued before that any lockdep splat is bad, because it disables
further lockdep checking and can hide other lockdep problems down the
road. The longer you live with it, the more potential locking problems
you can introduce unknowingly, that you'll need to work through one at a
time, as you uncover them later.

Regards,
  Felix

>
> At bare minimum we need a "/* TODO: ...." comment why we do this and
> how to remove the workaround again.
>
> Regards,
> Christian.
>
>>
>> Best Regards
>> Dennis lI
>>> Regards,
>>> Christian.
>>>
>>>> Regards,
>>>> Christian.
>>>>
>>>> Am 06.08.20 um 11:15 schrieb Li, Dennis:
>>>>> [AMD Official Use Only - Internal Distribution Only]
>>>>>
>>>>> Hi, Christian,
>>>>>          For this case, it is safe to use separated lock key.
>>>>> Please see the definition of init_rwsem as the below. Every
>>>>> init_rwsem calling will use a new static key, but devices of  the
>>>>> hive share the same code to do initialization, so their
>>>>> lock_class_key are the same. I think it is a limitation of
>>>>> init_rwsem.  In our case, it should be correct that reset_sem of
>>>>> each adev has different  lock_class_key. BTW, this change doesn't
>>>>> effect dead-lock detection and just correct it.
>>>>>
>>>>> #define init_rwsem(sem)                    \
>>>>> do {                                \
>>>>>     static struct lock_class_key __key;            \
>>>>>                                 \
>>>>>     __init_rwsem((sem), #sem, &__key);            \
>>>>> } while (0)
>>>>>
>>>>> Best Regards
>>>>> Dennis Li
>>>>> -----Original Message-----
>>>>> From: Koenig, Christian <Christian.Koenig at amd.com>
>>>>> Sent: Thursday, August 6, 2020 4:13 PM
>>>>> To: Li, Dennis <Dennis.Li at amd.com>; amd-gfx at lists.freedesktop.org;
>>>>> Deucher, Alexander <Alexander.Deucher at amd.com>; Kuehling, Felix
>>>>> <Felix.Kuehling at amd.com>; Zhang, Hawking <Hawking.Zhang at amd.com>
>>>>> Subject: Re: [PATCH] drm/amdgpu: annotate a false positive recursive
>>>>> locking
>>>>>
>>>>> Preventing locking problems during implementation is obviously a
>>>>> good approach, but lockdep has proven to be massively useful for
>>>>> finding and fixing problems.
>>>>>
>>>>> Disabling lockdep splat by annotating lock with separate classes
>>>>> is usually a no-go and only allowed if there is no other potential
>>>>> approach.
>>>>>
>>>>> In this case here we should really clean things up instead.
>>>>>
>>>>> Christian.
>>>>>
>>>>> Am 06.08.20 um 09:44 schrieb Li, Dennis:
>>>>>> [AMD Official Use Only - Internal Distribution Only]
>>>>>>
>>>>>> Hi, Christian,
>>>>>>            I agree with your concern. However we shouldn't rely
>>>>>> on system to detect dead-lock, and should consider this when
>>>>>> doing code implementation. In fact, dead-lock detection isn't
>>>>>> enabled by default.
>>>>>>            For your proposal to remove reset_sem into the hive
>>>>>> structure, we can open a new topic to discuss it. Currently we
>>>>>> couldn't make sure which is the best solution. For example, with
>>>>>> your proposal, we must wait for all devices resuming successfully
>>>>>> before resubmit an old task in one device, which will effect
>>>>>> performance.
>>>>>>
>>>>>> Best Regards
>>>>>> Dennis Li
>>>>>> -----Original Message-----
>>>>>> From: amd-gfx <amd-gfx-bounces at lists.freedesktop.org> On Behalf Of
>>>>>> Christian König
>>>>>> Sent: Thursday, August 6, 2020 3:08 PM
>>>>>> To: Li, Dennis <Dennis.Li at amd.com>; amd-gfx at lists.freedesktop.org;
>>>>>> Deucher, Alexander <Alexander.Deucher at amd.com>; Kuehling, Felix
>>>>>> <Felix.Kuehling at amd.com>; Zhang, Hawking <Hawking.Zhang at amd.com>
>>>>>> Subject: Re: [PATCH] drm/amdgpu: annotate a false positive
>>>>>> recursive locking
>>>>>>
>>>>>> Am 06.08.20 um 09:02 schrieb Dennis Li:
>>>>>>> [  584.110304] ============================================
>>>>>>> [  584.110590] WARNING: possible recursive locking detected
>>>>>>> [  584.110876] 5.6.0-deli-v5.6-2848-g3f3109b0e75f #1 Tainted:
>>>>>>> G           OE
>>>>>>> [  584.111164] --------------------------------------------
>>>>>>> [  584.111456] kworker/38:1/553 is trying to acquire lock:
>>>>>>> [  584.111721] ffff9b15ff0a47a0 (&adev->reset_sem){++++}, at:
>>>>>>> amdgpu_device_gpu_recover+0x262/0x1030 [amdgpu] [  584.112112]
>>>>>>>                      but task is already holding lock:
>>>>>>> [  584.112673] ffff9b1603d247a0 (&adev->reset_sem){++++}, at:
>>>>>>> amdgpu_device_gpu_recover+0x262/0x1030 [amdgpu] [  584.113068]
>>>>>>>                      other info that might help us debug this:
>>>>>>> [  584.113689]  Possible unsafe locking scenario:
>>>>>>>
>>>>>>> [  584.114350]        CPU0
>>>>>>> [  584.114685]        ----
>>>>>>> [  584.115014]   lock(&adev->reset_sem);
>>>>>>> [  584.115349]   lock(&adev->reset_sem);
>>>>>>> [  584.115678]
>>>>>>>                       *** DEADLOCK ***
>>>>>>>
>>>>>>> [  584.116624]  May be due to missing lock nesting notation
>>>>>>>
>>>>>>> [  584.117284] 4 locks held by kworker/38:1/553:
>>>>>>> [  584.117616]  #0: ffff9ad635c1d348
>>>>>>> ((wq_completion)events){+.+.},
>>>>>>> at: process_one_work+0x21f/0x630 [  584.117967]  #1:
>>>>>>> ffffac708e1c3e58 ((work_completion)(&con->recovery_work)){+.+.},
>>>>>>> at:
>>>>>>> process_one_work+0x21f/0x630 [  584.118358]  #2: ffffffffc1c2a5d0
>>>>>>> (&tmp->hive_lock){+.+.}, at:
>>>>>>> amdgpu_device_gpu_recover+0xae/0x1030 [amdgpu] [  584.118786] 
>>>>>>> #3: ffff9b1603d247a0 (&adev->reset_sem){++++}, at:
>>>>>>> amdgpu_device_gpu_recover+0x262/0x1030 [amdgpu] [  584.119222]
>>>>>>>                      stack backtrace:
>>>>>>> [  584.119990] CPU: 38 PID: 553 Comm: kworker/38:1 Kdump: loaded
>>>>>>> Tainted: G           OE     5.6.0-deli-v5.6-2848-g3f3109b0e75f #1
>>>>>>> [  584.120782] Hardware name: Supermicro SYS-7049GP-TRT/X11DPG-QT,
>>>>>>> BIOS 3.1 05/23/2019 [  584.121223] Workqueue: events
>>>>>>> amdgpu_ras_do_recovery [amdgpu] [  584.121638] Call Trace:
>>>>>>> [  584.122050]  dump_stack+0x98/0xd5 [  584.122499]
>>>>>>> __lock_acquire+0x1139/0x16e0 [  584.122931]  ?
>>>>>>> trace_hardirqs_on+0x3b/0xf0 [  584.123358]  ?
>>>>>>> cancel_delayed_work+0xa6/0xc0 [  584.123771]
>>>>>>> lock_acquire+0xb8/0x1c0 [  584.124197]  ?
>>>>>>> amdgpu_device_gpu_recover+0x262/0x1030 [amdgpu] [ 584.124599] 
>>>>>>> down_write+0x49/0x120 [  584.125032]  ?
>>>>>>> amdgpu_device_gpu_recover+0x262/0x1030 [amdgpu] [  584.125472]
>>>>>>> amdgpu_device_gpu_recover+0x262/0x1030 [amdgpu] [  584.125910]  ?
>>>>>>> amdgpu_ras_error_query+0x1b8/0x2a0 [amdgpu] [  584.126367]
>>>>>>> amdgpu_ras_do_recovery+0x159/0x190 [amdgpu] [  584.126789]
>>>>>>> process_one_work+0x29e/0x630 [  584.127208]
>>>>>>> worker_thread+0x3c/0x3f0 [  584.127621]  ?
>>>>>>> __kthread_parkme+0x61/0x90 [  584.128014]
>>>>>>> kthread+0x12f/0x150 [  584.128402]  ? process_one_work+0x630/0x630
>>>>>>> kthread+[
>>>>>>> 584.128790]  ? kthread_park+0x90/0x90 [  584.129174]
>>>>>>> ret_from_fork+0x3a/0x50
>>>>>>>
>>>>>>> Each adev has owned lock_class_key to avoid false positive
>>>>>>> recursive locking.
>>>>>> NAK, that is not a false positive but a real problem.
>>>>>>
>>>>>> The issue here is that we have multiple reset semaphores, one for
>>>>>> each device in the hive. If those are not acquired in the correct
>>>>>> order we deadlock.
>>>>>>
>>>>>> The real solution would be to move the reset_sem into the hive
>>>>>> structure and make sure that we lock it only once.
>>>>>>
>>>>>> Christian.
>>>>>>
>>>>>>> Signed-off-by: Dennis Li <Dennis.Li at amd.com>
>>>>>>> Change-Id: I7571efeccbf15483982031d00504a353031a854a
>>>>>>>
>>>>>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu.h
>>>>>>> b/drivers/gpu/drm/amd/amdgpu/amdgpu.h
>>>>>>> index e97c088d03b3..766dc8f8c8a0 100644
>>>>>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu.h
>>>>>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu.h
>>>>>>> @@ -967,6 +967,7 @@ struct amdgpu_device {
>>>>>>>            atomic_t                        in_gpu_reset;
>>>>>>>            enum pp_mp1_state               mp1_state;
>>>>>>>            struct rw_semaphore    reset_sem;
>>>>>>> +    struct lock_class_key lock_key;
>>>>>>>            struct amdgpu_doorbell_index doorbell_index;
>>>>>>>                   struct mutex            notifier_lock;
>>>>>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>>>>>>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>>>>>>> index 6c572db42d92..d78df9312d34 100644
>>>>>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>>>>>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>>>>>>> @@ -3037,6 +3037,7 @@ int amdgpu_device_init(struct
>>>>>>> amdgpu_device *adev,
>>>>>>>            mutex_init(&adev->virt.vf_errors.lock);
>>>>>>>            hash_init(adev->mn_hash);
>>>>>>>            init_rwsem(&adev->reset_sem);
>>>>>>> +    lockdep_set_class(&adev->reset_sem, &adev->lock_key);
>>>>>>>            atomic_set(&adev->in_gpu_reset, 0);
>>>>>>>            mutex_init(&adev->psp.mutex);
>>>>>>>            mutex_init(&adev->notifier_lock);
>>>>>> _______________________________________________
>>>>>> amd-gfx mailing list
>>>>>> amd-gfx at lists.freedesktop.org
>>>>>> https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Fl
>>>>>> i
>>>>>> s
>>>>>> t
>>>>>> s.freedesktop.org%2Fmailman%2Flistinfo%2Famd-gfx&data=02%7C01%7
>>>>>> C
>>>>>> D
>>>>>> e
>>>>>> nnis.Li%40amd.com%7C56c95f939ddd441bd10408d839d77c9e%7C3dd8961fe488
>>>>>> 4
>>>>>> e
>>>>>> 6
>>>>>> 08e11a82d994e183d%7C0%7C0%7C637322944985436656&sdata=%2FBoRyEW3
>>>>>> i
>>>>>> K
>>>>>> 5
>>>>>> 9Y52ctLWd4y1lOmi2CInb6lpIgAF88i4%3D&reserved=0
>