[PATCH v3] drm/amdgpu: annotate a false positive recursive locking

Tue Aug 11 06:55:28 UTC 2020

On Tue, Aug 11, 2020 at 4:12 AM Dennis Li <Dennis.Li at amd.com> wrote:
>
> [  584.110304] ============================================
> [  584.110590] WARNING: possible recursive locking detected
> [  584.110876] 5.6.0-deli-v5.6-2848-g3f3109b0e75f #1 Tainted: G           OE
> [  584.111164] --------------------------------------------
> [  584.111456] kworker/38:1/553 is trying to acquire lock:
> [  584.111721] ffff9b15ff0a47a0 (&adev->reset_sem){++++}, at: amdgpu_device_gpu_recover+0x262/0x1030 [amdgpu]
> [  584.112112]
>                but task is already holding lock:
> [  584.112673] ffff9b1603d247a0 (&adev->reset_sem){++++}, at: amdgpu_device_gpu_recover+0x262/0x1030 [amdgpu]
> [  584.113068]
>                other info that might help us debug this:
> [  584.113689]  Possible unsafe locking scenario:
>
> [  584.114350]        CPU0
> [  584.114685]        ----
> [  584.115014]   lock(&adev->reset_sem);
> [  584.115349]   lock(&adev->reset_sem);
> [  584.115678]
>                 *** DEADLOCK ***
>
> [  584.116624]  May be due to missing lock nesting notation
>
> [  584.117284] 4 locks held by kworker/38:1/553:
> [  584.117616]  #0: ffff9ad635c1d348 ((wq_completion)events){+.+.}, at: process_one_work+0x21f/0x630
> [  584.117967]  #1: ffffac708e1c3e58 ((work_completion)(&con->recovery_work)){+.+.}, at: process_one_work+0x21f/0x630
> [  584.118358]  #2: ffffffffc1c2a5d0 (&tmp->hive_lock){+.+.}, at: amdgpu_device_gpu_recover+0xae/0x1030 [amdgpu]
> [  584.118786]  #3: ffff9b1603d247a0 (&adev->reset_sem){++++}, at: amdgpu_device_gpu_recover+0x262/0x1030 [amdgpu]
> [  584.119222]
>                stack backtrace:
> [  584.119990] CPU: 38 PID: 553 Comm: kworker/38:1 Kdump: loaded Tainted: G           OE     5.6.0-deli-v5.6-2848-g3f3109b0e75f #1
> [  584.120782] Hardware name: Supermicro SYS-7049GP-TRT/X11DPG-QT, BIOS 3.1 05/23/2019
> [  584.121223] Workqueue: events amdgpu_ras_do_recovery [amdgpu]
> [  584.121638] Call Trace:
> [  584.122050]  dump_stack+0x98/0xd5
> [  584.122499]  __lock_acquire+0x1139/0x16e0
> [  584.122931]  ? trace_hardirqs_on+0x3b/0xf0
> [  584.123358]  ? cancel_delayed_work+0xa6/0xc0
> [  584.123771]  lock_acquire+0xb8/0x1c0
> [  584.124197]  ? amdgpu_device_gpu_recover+0x262/0x1030 [amdgpu]
> [  584.124599]  down_write+0x49/0x120
> [  584.125032]  ? amdgpu_device_gpu_recover+0x262/0x1030 [amdgpu]
> [  584.125472]  amdgpu_device_gpu_recover+0x262/0x1030 [amdgpu]
> [  584.125910]  ? amdgpu_ras_error_query+0x1b8/0x2a0 [amdgpu]
> [  584.126367]  amdgpu_ras_do_recovery+0x159/0x190 [amdgpu]
> [  584.126789]  process_one_work+0x29e/0x630
> [  584.127208]  worker_thread+0x3c/0x3f0
> [  584.127621]  ? __kthread_parkme+0x61/0x90
> [  584.128014]  kthread+0x12f/0x150
> [  584.128402]  ? process_one_work+0x630/0x630
> [  584.128790]  ? kthread_park+0x90/0x90
> [  584.129174]  ret_from_fork+0x3a/0x50
>
> Each adev has owned lock_class_key to avoid false positive
> recursive locking.
>
> v2:
> 1. register adev->lock_key into lockdep, otherwise lockdep will
> report the below warning
>
> [ 1216.705820] BUG: key ffff890183b647d0 has not been registered!
> [ 1216.705924] ------------[ cut here ]------------
> [ 1216.705972] DEBUG_LOCKS_WARN_ON(1)
> [ 1216.705997] WARNING: CPU: 20 PID: 541 at kernel/locking/lockdep.c:3743 lockdep_init_map+0x150/0x210
>
> v3:
> change to use down_write_nest_lock to annotate the false dead-lock
> warning.
>
> Signed-off-by: Dennis Li <Dennis.Li at amd.com>

I'd explain a bit more in writing what the actual deadlock possibility
is, but up to you folks.

Reviewed-by: Daniel Vetter <daniel.vetter at ffwll.ch>

Cheers, Daniel
>
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> index 62ecac97fbd2..8a55b0bc044a 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> @@ -4145,12 +4145,15 @@ static int amdgpu_do_asic_reset(struct amdgpu_hive_info *hive,
>         return r;
>  }
>
> -static bool amdgpu_device_lock_adev(struct amdgpu_device *adev)
> +static bool amdgpu_device_lock_adev(struct amdgpu_device *adev, struct amdgpu_hive_info *hive)
>  {
>         if (atomic_cmpxchg(&adev->in_gpu_reset, 0, 1) != 0)
>                 return false;
>
> -       down_write(&adev->reset_sem);
> +       if (hive) {
> +               down_write_nest_lock(&adev->reset_sem, &hive->hive_lock);
> +       } else
> +               down_write(&adev->reset_sem);
>
>         atomic_inc(&adev->gpu_reset_counter);
>         switch (amdgpu_asic_reset_method(adev)) {
> @@ -4312,7 +4315,7 @@ int amdgpu_device_gpu_recover(struct amdgpu_device *adev,
>
>         /* block all schedulers and reset given job's ring */
>         list_for_each_entry(tmp_adev, device_list_handle, gmc.xgmi.head) {
> -               if (!amdgpu_device_lock_adev(tmp_adev)) {
> +               if (!amdgpu_device_lock_adev(tmp_adev, hive)) {
>                         DRM_INFO("Bailing on TDR for s_job:%llx, as another already in progress",
>                                   job ? job->base.id : -1);
>                         r = 0;
> --
> 2.17.1
>

-- 
Daniel Vetter
Software Engineer, Intel Corporation
http://blog.ffwll.ch