[PATCH] drm/amdgpu: annotate a false positive recursive locking

Fri Aug 7 12:28:39 UTC 2020

On Fri, Aug 7, 2020 at 1:59 PM Li, Dennis <Dennis.Li at amd.com> wrote:
>
> [AMD Public Use]
>
> Hi, Daniel,
>       Thanks your review. And I also understand your concern. I guess you missed the description of this issue, so I paste it again in the below, and explain why this issue happens.
>
>       For example, in a XGMI system with 2 GPU devices whose device entity is named adev. And each adev has a separated reset_sem.
>       // device init function
>       void device_init(adev) {
>         ...
>         init_rwsem(&adev->reset_sem);
>               ...
>       }
>
>      XGMI system have two GPU devices, so system will call device_init twice. However the definition of init_rwsem macro has a limitation which use a local static lock_class_key to initialize rw_sem, which cause each adev->reset_sem share the same lock_class_key.
>
>      #define init_rwsem(sem)                                                 \
>      do {                                                               \
>          static struct lock_class_key __key;                    \
>                                                                 \
>         __init_rwsem((sem), #sem, &__key);                      \
>      } while (0)
>
>      And when GPU hang, we should do gpu recovery for all devices in the hive. Therefore we should lock each adev->reset_sem to protect GPU from be accessed by other threads during recovery. The detailed recovery sequence as the following:
>      // Step1: lock all GPU firstly:
>      for each adev of GPU0 or GPU1 {
>            down_write(adev->reset_sem);
>            do_suspend(adev);
>     }
>
>     // Step2:
>     do_hive_recovery(hive);
>
>     // Step3: resume and unlock GPU
>     for each adev of GPU0 or GPU1 {
>           do_resume(adev)
>           up_write(adev->reset_sem);
>     }
>
>     Each adev->reset_sem has the same lock_class_key, and lockdep will take them as the same rw_semaphore object. Therefore in step1, when lock GPU1, lockdep will pop the below warning.
>
>     I have considered your proposal (using  _nest_lock() ) before. Just as you said, _nest_lock() will hide true positive recursive locking. So I gave up it in the end.
>
>     After reviewing codes of lockdep, I found the lockdep support dynamic_key, so using separated lock_class_key has no any side effect. In fact, init_rwsem also use dynamic_key. Please see the call sequence of init_rwsem and lockdep_set_class as the below:
>    1) init_rwsem -> __init_rwsem -> lockdep_init_map;
>    2) lockdep_set_class -> lockdep_init_map;
>
>     Finally we go back to your concern, you maybe worry this change will cause the below dead-lock can't be detected. In fact, lockdep still is able to detect the issue as circular locking dependency, but there is no warning "recursive locking " in this case.
>     Thread A: down_write(adev->reset_sem) for GPU 0 -> down_write(adev->reset_sem) for GPU 1 -> ... -> up_write(adev->reset_sem) for GPU 1 -> up_write(adev->reset_sem) for GPU 0
>     Thread B: down_write(adev->reset_sem) for GPU 1 -> down_write(adev->reset_sem) for GPU 0 -> ... -> up_write(adev->reset_sem) for GPU 0 -> up_write(adev->reset_sem) for GPU 1
>
>     But lockdep still can detect recursive locking for this case:
>     Thread A: down_write(adev->reset_sem) for GPU 0 -> ...-> ...-> down_write(adev->reset_sem) for GPU 0

Yeah, I guessed correctly what you're doing. My recommendation to use
down_write_nest_lock still stands. This is for reset only, kernel-wide
reset lock really shouldn't hurt. Or make it a lock per xgmi hive, I'm
assuming that information is known somewhere.

The problem with manual lockdep annotations is that they increase
complexity. You have to keep all the annotations in mind, including
justifcations and which parts they still catch and which parts they
don't catch. And there's zero performance justification for a gpu
reset path to create some fancy lockdep special cases.

Locking needs to be
1. maintainable, i.e. every time you need to write a multi-paragraph
explanation like the above it's probably not. This obviously includes
correctness, but it's even more important that people can easily
understand what you're doing.
2. fast enough, where it matters. gpu reset just doesn't.

> [  584.110304] ============================================
> [  584.110590] WARNING: possible recursive locking detected
> [  584.110876] 5.6.0-deli-v5.6-2848-g3f3109b0e75f #1 Tainted: G           OE
> [  584.111164] --------------------------------------------
> [  584.111456] kworker/38:1/553 is trying to acquire lock:
> [  584.111721] ffff9b15ff0a47a0 (&adev->reset_sem){++++}, at: amdgpu_device_gpu_recover+0x262/0x1030 [amdgpu]
> [  584.112112]
>                but task is already holding lock:
> [  584.112673] ffff9b1603d247a0 (&adev->reset_sem){++++}, at: amdgpu_device_gpu_recover+0x262/0x1030 [amdgpu]
> [  584.113068]
>                other info that might help us debug this:
> [  584.113689]  Possible unsafe locking scenario:
>
> [  584.114350]        CPU0
> [  584.114685]        ----
> [  584.115014]   lock(&adev->reset_sem);
> [  584.115349]   lock(&adev->reset_sem);
> [  584.115678]
>                 *** DEADLOCK ***
>
> [  584.116624]  May be due to missing lock nesting notation
>
> [  584.117284] 4 locks held by kworker/38:1/553:
> [  584.117616]  #0: ffff9ad635c1d348 ((wq_completion)events){+.+.}, at: process_one_work+0x21f/0x630
> [  584.117967]  #1: ffffac708e1c3e58 ((work_completion)(&con->recovery_work)){+.+.}, at: process_one_work+0x21f/0x630
> [  584.118358]  #2: ffffffffc1c2a5d0 (&tmp->hive_lock){+.+.}, at: amdgpu_device_gpu_recover+0xae/0x1030 [amdgpu]
> [  584.118786]  #3: ffff9b1603d247a0 (&adev->reset_sem){++++}, at: amdgpu_device_gpu_recover+0x262/0x1030 [amdgpu]

Uh, so this means this rwsem is in the critical path for dma-fence
signalling. Please make sure this is all correctly primed at driver
load (like we do alredy for dma_resv/fence and drm_modeset_lock) when
lockdep is enabled. Otherwise pretty much guaranteed you'll get this
wrong somewhere.
-Daniel

>
> Best Regards
> Dennis Li
> -----Original Message-----
> From: Daniel Vetter <daniel at ffwll.ch>
> Sent: Friday, August 7, 2020 5:45 PM
> To: Koenig, Christian <Christian.Koenig at amd.com>
> Cc: Li, Dennis <Dennis.Li at amd.com>; amd-gfx at lists.freedesktop.org; Deucher, Alexander <Alexander.Deucher at amd.com>; Kuehling, Felix <Felix.Kuehling at amd.com>; Zhang, Hawking <Hawking.Zhang at amd.com>
> Subject: Re: [PATCH] drm/amdgpu: annotate a false positive recursive locking
>
> On Fri, Aug 7, 2020 at 11:34 AM Christian König <christian.koenig at amd.com> wrote:
> >
> > Am 07.08.20 um 11:23 schrieb Daniel Vetter:
> > > On Fri, Aug 7, 2020 at 11:20 AM Daniel Vetter <daniel at ffwll.ch> wrote:
> > >> On Fri, Aug 7, 2020 at 11:08 AM Christian König
> > >> <christian.koenig at amd.com> wrote:
> > >>> [SNIP]
> > >>>>>>> What we should do instead is to make sure we have only a single lock for the complete hive instead.
> > >>>>>>> [Dennis Li] If we use a single lock, users will must wait for all devices resuming successfully, but in fact their tasks are only running in device a. It is not friendly to users.
> > >>>>>> Well that is intentional I would say. We can only restart submissions after all devices are resumed successfully, cause otherwise it can happen that a job on device A depends on memory on device B which isn't accessible.
> > >>>>>> [Dennis Li] Yes, you are right. Driver have make sure that the shared resources(such as the shard memory) are ready before unlock the lock of adev one by one. But each device still has private hardware resources such as video  and display.
> > >>>>> Yeah, but those are negligible and we have a team working on display support for XGMI. So this will sooner or later become a problem as well.
> > >>>>>
> > >>>>> I still think that a single rwsem for the whole hive is still the best option here.
> > >>>>>
> > >>>>> [Dennis Li] For your above concern, we can open a new thread to discuss it. If we make a decision to use a single after discussing, we can create another patch for it.
> > >>>> That's a really good argument, but I still hesitate to merge this patch.
> > >>>> How severe is the lockdep splat?
> > >>>>
> > >>>> At bare minimum we need a "/* TODO: ...." comment why we do this and how to remove the workaround again.
> > >>>> [Dennis Li] It is not a workaround. According to design of lockdep, each rw_semaphore should has a separated lock_class_key. I have explained that init_rwsem has the above limitation, so we must correct it.
> > >>> Yeah, but this explanation is not really sufficient. Again this is
> > >>> not an limitation of init_rwsem, but intentional behavior.
> > >>>
> > >>> In other words in most cases lockdep should splat if you use rw
> > >>> semaphores like this.
> > >>>
> > >>> That we avoid this by locking multiple amdgpu device always in the
> > >>> same order is rather questionable driver design.
> > >> Yeah don't give multiple locking classes like that, that's fairly
> > >> terrible.
> >
> > Ok, that is exactly the answer I feared we would get.
> >
> > >> The right fix is mutex_lock_nest_lock, which we're using in
> > >> ww_mutex or in which is also used in mm_take_all_locks.
> >
> > Ah! Enlightenment! So in this case we should use down_write_nested()
> > in this special space and all is well?
>
> Nope. There's two kinds of nesting that lockdep supports:
> 1. _nested() variants, where you supply an integer parameter specifying the lockdep subclass. That's like setting a locking subclass at lock creating time, except much worse because you change the lockdep class at runtime. This can lead to stuff like:
>
> struct mutex A;
> mutex_init(&A);
> mutex_lock(&A);
> mutex_lock_nested(&A, SINGLE_DEPTH_NESTING);
>
> which is a very clear deadlock, but lockdep will think it's all fine.
> Don't use that.
>
> 2. _nest_lock() variants, where you specify a super-lock, which makes sure that all sub-locks cannot be acquired concurrently (or you promise there's some other trick to avoid deadlocks). Lockdep checks that you've acquired the super-lock, and then allows arbitary nesting.
> This is what ww_mutex uses, and what mm_take_all_locks uses. If the super-lock is an actual mutex, then this is actually deadlock free.
> With ww_mutex it's a fake lockdep_map in the ww_acquire_ctx, but we guarantee deadlock-freeness through the ww algorithm. The fake super lock for ww_mutex is acquired when you get the ticket.
>
> No 2 is the thing you want, not no 1.
>
> > >>
> > >> If you solve the locking ordering by sorting all the locks you need
> > >> (ugh), then you can also just use a fake lockdep_map for your
> > >> special function only.
> > >>
> > >> Also in general the idea of an rwsem in conjunction with xgmi cross
> > >> device memory handling just plain freaks me out :-) Please make
> > >> sure you prime this lock against dma_resv and dma_fence (and maybe
> > >> also drm_modeset_lock if this lock is outside of dma_resv), and
> > >> ideally this should only ever be used for setup stuff when loading
> > >> drivers. I guess that's why you went with a per-device rwsem. If
> > >> it's really only for setup then just do the simple thing of having
> > >> an xgmi_hive_reconfigure lock, which all the rwsems nest within,
> > >> and no fancy lock ordering or other tricks.
> >
> > Well this is for the group reset of all devices in a hive and you need
> > to reboot to change the order the device are in the list.
>
> Ok, that sounds save enough. I'd go with a xgmi_hive_lock then, and down_write_nest_lock(&adev->xgmi_rwsem, &xgmi_hive_lock); Taking the global lock for device reset shouldn't be a problem, we're not going to reset multiple devices in parallel anyway. Or at least I don't hope that use-case is a perf critical benchmark for you guys :-)
>
> Cheers, Daniel
>
> > Regards,
> > Christian.
> >
> > >>
> > >>>>    Core network driver (net/core/dev.c) has the similar use case with us.
> > >>> Just took a look at that, but this is completely different use
> > >>> case. The lockdep classes are grouped by driver type to note the
> > >>> difference in the network stack.
> > >>>
> > >>> A quick search showed that no other device driver is doing this in
> > >>> the kernel, so I'm not sure if such a behavior wouldn't be
> > >>> rejected. Adding Daniel to comment on this as well.
> > >>>
> > >>> Felix had some really good arguments to make that an urgent fix,
> > >>> so either we come up with some real rework of this or at least add
> > >>> a big
> > >>> "/* TODO....*/".
> > >> Aside from "I have questions" I don't think there's any reason to
> > >> no fix this correctly with mutex_lock_nest_lock. Really shouldn't
> > >> be a semantic change from what I'm understand here.
> > > Also one more with my drm maintainer hat on, as a heads-up: Dave&me
> > > looked at i915 gem code a bit too much, and we're appalled at the
> > > massive over-use of lockdep class hacks and dubious nesting
> > > annotations. Expect crack down on anyone who tries to play clever
> > > tricks here, we have a few too many already :-)
> > >
> > > So no mutex_lock_nested (totally different beast from
> > > mutex_lock_nest_lock, and really unsafe since it's a runtime change
> > > of the lockdep class) and also not any lockdep_set_class trickery or
> > > anything like that. We need locking hierarchies which mere humans
> > > can comprehend and review.
> > >
> > > Cheers, Daniel
> > >
> > >> Cheers, Daniel
> > >>
> > >>> Regards,
> > >>> Christian.
> > >>>
> > >>>> Regards,
> > >>>> Christian.
> > >>>>
> > >>
> > >> --
> > >> Daniel Vetter
> > >> Software Engineer, Intel Corporation
> > >> https://nam11.safelinks.protection.outlook.com/?url=http%3A%2F%2Fbl
> > >> og.ffwll.ch%2F&data=02%7C01%7CDennis.Li%40amd.com%7C604aa31bd86
> > >> 4459226ee08d83ab684dd%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C
> > >> 637323902882042827&sdata=DieZpl%2BIJ73STTkfn9UXDXmtssPXKKbtPsut
> > >> BRp%2FS%2BE%3D&reserved=0
> > >
> > >
> >
>
>
> --
> Daniel Vetter
> Software Engineer, Intel Corporation
> https://nam11.safelinks.protection.outlook.com/?url=http%3A%2F%2Fblog.ffwll.ch%2F&data=02%7C01%7CDennis.Li%40amd.com%7C604aa31bd864459226ee08d83ab684dd%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637323902882042827&sdata=DieZpl%2BIJ73STTkfn9UXDXmtssPXKKbtPsutBRp%2FS%2BE%3D&reserved=0

-- 
Daniel Vetter
Software Engineer, Intel Corporation
http://blog.ffwll.ch