[PATCH v3] drm/amdgpu: annotate a false positive recursive locking

Li, Dennis Dennis.Li at amd.com
Tue Aug 11 10:34:09 UTC 2020


[AMD Official Use Only - Internal Distribution Only]


Quick grep says it is protected like that ... can you pls paste the full build error without {}?
-Daniel

[Dennis Li] hi, Daniel, the full build error as the following: 

make: Entering directory '/home/yajunl/workspace/amd/brahma-staging/BUILD/x86_64/linux'
  CC [M]  drivers/gpu/drm/amd/amdgpu/amdgpu_device.o
drivers/gpu/drm/amd/amdgpu/amdgpu_device.c: In function ‘amdgpu_device_lock_adev’:
drivers/gpu/drm/amd/amdgpu/amdgpu_device.c:4155:2: error: ‘else’ without a previous ‘if’
  else
  ^~~~
scripts/Makefile.build:267: recipe for target 'drivers/gpu/drm/amd/amdgpu/amdgpu_device.o' failed
make[1]: *** [drivers/gpu/drm/amd/amdgpu/amdgpu_device.o] Error 1
Makefile:1683: recipe for target 'drivers/gpu/drm/amd/amdgpu' failed
make: *** [drivers/gpu/drm/amd/amdgpu] Error 2



>
> Let me take a look,
> Christian.
>
> >
> > Regards,
> > Guchun
> >
> > -----Original Message-----
> > From: amd-gfx <amd-gfx-bounces at lists.freedesktop.org> On Behalf Of 
> > Christian König
> > Sent: Tuesday, August 11, 2020 2:53 PM
> > To: Li, Dennis <Dennis.Li at amd.com>; amd-gfx at lists.freedesktop.org; 
> > Deucher, Alexander <Alexander.Deucher at amd.com>; Kuehling, Felix 
> > <Felix.Kuehling at amd.com>; Zhang, Hawking <Hawking.Zhang at amd.com>; 
> > daniel at ffwll.ch
> > Subject: Re: [PATCH v3] drm/amdgpu: annotate a false positive 
> > recursive locking
> >
> > Am 11.08.20 um 04:12 schrieb Dennis Li:
> >> [  584.110304] ============================================
> >> [  584.110590] WARNING: possible recursive locking detected
> >> [  584.110876] 5.6.0-deli-v5.6-2848-g3f3109b0e75f #1 Tainted: G           OE
> >> [  584.111164] --------------------------------------------
> >> [  584.111456] kworker/38:1/553 is trying to acquire lock:
> >> [  584.111721] ffff9b15ff0a47a0 (&adev->reset_sem){++++}, at:
> >> amdgpu_device_gpu_recover+0x262/0x1030 [amdgpu] [  584.112112]
> >>                  but task is already holding lock:
> >> [  584.112673] ffff9b1603d247a0 (&adev->reset_sem){++++}, at:
> >> amdgpu_device_gpu_recover+0x262/0x1030 [amdgpu] [  584.113068]
> >>                  other info that might help us debug this:
> >> [  584.113689]  Possible unsafe locking scenario:
> >>
> >> [  584.114350]        CPU0
> >> [  584.114685]        ----
> >> [  584.115014]   lock(&adev->reset_sem);
> >> [  584.115349]   lock(&adev->reset_sem);
> >> [  584.115678]
> >>                   *** DEADLOCK ***
> >>
> >> [  584.116624]  May be due to missing lock nesting notation
> >>
> >> [  584.117284] 4 locks held by kworker/38:1/553:
> >> [  584.117616]  #0: ffff9ad635c1d348 ((wq_completion)events){+.+.},
> >> at: process_one_work+0x21f/0x630 [  584.117967]  #1: 
> >> ffffac708e1c3e58 ((work_completion)(&con->recovery_work)){+.+.}, at:
> >> process_one_work+0x21f/0x630 [  584.118358]  #2: ffffffffc1c2a5d0 
> >> (&tmp->hive_lock){+.+.}, at: amdgpu_device_gpu_recover+0xae/0x1030 [amdgpu] [  584.118786]  #3: ffff9b1603d247a0 (&adev->reset_sem){++++}, at: amdgpu_device_gpu_recover+0x262/0x1030 [amdgpu] [  584.119222]
> >>                  stack backtrace:
> >> [  584.119990] CPU: 38 PID: 553 Comm: kworker/38:1 Kdump: loaded Tainted: G           OE     5.6.0-deli-v5.6-2848-g3f3109b0e75f #1
> >> [  584.120782] Hardware name: Supermicro SYS-7049GP-TRT/X11DPG-QT, 
> >> BIOS 3.1 05/23/2019 [  584.121223] Workqueue: events 
> >> amdgpu_ras_do_recovery [amdgpu] [  584.121638] Call Trace:
> >> [  584.122050]  dump_stack+0x98/0xd5 [  584.122499]  
> >> __lock_acquire+0x1139/0x16e0 [  584.122931]  ?
> >> trace_hardirqs_on+0x3b/0xf0 [  584.123358]  ?
> >> cancel_delayed_work+0xa6/0xc0 [  584.123771]  
> >> lock_acquire+0xb8/0x1c0 [  584.124197]  ? 
> >> amdgpu_device_gpu_recover+0x262/0x1030 [amdgpu] [ 584.124599]  down_write+0x49/0x120 [  584.125032]  ?
> >> amdgpu_device_gpu_recover+0x262/0x1030 [amdgpu] [  584.125472]
> >> amdgpu_device_gpu_recover+0x262/0x1030 [amdgpu] [  584.125910]  ?
> >> amdgpu_ras_error_query+0x1b8/0x2a0 [amdgpu] [  584.126367]
> >> amdgpu_ras_do_recovery+0x159/0x190 [amdgpu] [  584.126789]
> >> process_one_work+0x29e/0x630 [  584.127208]  
> >> worker_thread+0x3c/0x3f0 [  584.127621]  ? 
> >> __kthread_parkme+0x61/0x90 [  584.128014]
> >> kthread+0x12f/0x150 [  584.128402]  ? process_one_work+0x630/0x630 
> >> kthread+[
> >> 584.128790]  ? kthread_park+0x90/0x90 [  584.129174]
> >> ret_from_fork+0x3a/0x50
> >>
> >> Each adev has owned lock_class_key to avoid false positive 
> >> recursive locking.
> >>
> >> v2:
> >> 1. register adev->lock_key into lockdep, otherwise lockdep will 
> >> report the below warning
> >>
> >> [ 1216.705820] BUG: key ffff890183b647d0 has not been registered!
> >> [ 1216.705924] ------------[ cut here ]------------ [ 1216.705972]
> >> DEBUG_LOCKS_WARN_ON(1) [ 1216.705997] WARNING: CPU: 20 PID: 541 at
> >> kernel/locking/lockdep.c:3743 lockdep_init_map+0x150/0x210
> >>
> >> v3:
> >> change to use down_write_nest_lock to annotate the false dead-lock 
> >> warning.
> >>
> >> Signed-off-by: Dennis Li <Dennis.Li at amd.com>
> >>
> >> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> >> b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> >> index 62ecac97fbd2..8a55b0bc044a 100644
> >> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> >> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> >> @@ -4145,12 +4145,15 @@ static int amdgpu_do_asic_reset(struct amdgpu_hive_info *hive,
> >>      return r;
> >>    }
> >>
> >> -static bool amdgpu_device_lock_adev(struct amdgpu_device *adev)
> >> +static bool amdgpu_device_lock_adev(struct amdgpu_device *adev, 
> >> +struct amdgpu_hive_info *hive)
> >>    {
> >>      if (atomic_cmpxchg(&adev->in_gpu_reset, 0, 1) != 0)
> >>              return false;
> >>
> >> -    down_write(&adev->reset_sem);
> >> +    if (hive) {
> >> +            down_write_nest_lock(&adev->reset_sem, &hive->hive_lock);
> >> +    } else
> >> +            down_write(&adev->reset_sem);
> > Coding style nit pick: You should drop the {} here.
> >
> > Apart from that the patch is Reviewed-by: Christian König 
> > <christian.koenig at amd.com>
> >
> >>
> >>      atomic_inc(&adev->gpu_reset_counter);
> >>      switch (amdgpu_asic_reset_method(adev)) { @@ -4312,7 +4315,7 
> >> @@ int amdgpu_device_gpu_recover(struct amdgpu_device *adev,
> >>
> >>      /* block all schedulers and reset given job's ring */
> >>      list_for_each_entry(tmp_adev, device_list_handle, gmc.xgmi.head) {
> >> -            if (!amdgpu_device_lock_adev(tmp_adev)) {
> >> +            if (!amdgpu_device_lock_adev(tmp_adev, hive)) {
> >>                      DRM_INFO("Bailing on TDR for s_job:%llx, as another already in progress",
> >>                                job ? job->base.id : -1);
> >>                      r = 0;
> > _______________________________________________
> > amd-gfx mailing list
> > amd-gfx at lists.freedesktop.org
> > https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Fli
> > sts.freedesktop.org%2Fmailman%2Flistinfo%2Famd-gfx&data=02%7C01%
> > 7CDennis.Li%40amd.com%7Cee5574efd1f2475112a908d83dca5daa%7C3dd8961fe
> > 4884e608e11a82d994e183d%7C0%7C0%7C637327286674286678&sdata=K5Sx4
> > cO3DIQaxNRSo6q%2B8u%2BscGkCpy0ueQ0cOWyipFA%3D&reserved=0
>


--
Daniel Vetter
Software Engineer, Intel Corporation
https://nam11.safelinks.protection.outlook.com/?url=http%3A%2F%2Fblog.ffwll.ch%2F&data=02%7C01%7CDennis.Li%40amd.com%7Cee5574efd1f2475112a908d83dca5daa%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637327286674286678&sdata=7tC0Gakzg7uFgxH6N9pvn5%2B5TCYL3aKK%2FhHTkTyCmtM%3D&reserved=0


More information about the amd-gfx mailing list