[PATCH] drm/amdkfd: Fix lock dependency warning with srcu

Wed Jan 3 18:59:33 UTC 2024

On 2024-01-03 10:56, Philip Yang wrote:
> ======================================================
> WARNING: possible circular locking dependency detected
> 6.5.0-kfd-yangp #2289 Not tainted
> ------------------------------------------------------
> kworker/0:2/996 is trying to acquire lock:
>          (srcu){.+.+}-{0:0}, at: __synchronize_srcu+0x5/0x1a0
>
> but task is already holding lock:
>          ((work_completion)(&svms->deferred_list_work)){+.+.}-{0:0}, at:
> 	process_one_work+0x211/0x560
>
> which lock already depends on the new lock.
>
> the existing dependency chain (in reverse order) is:
>
> -> #3 ((work_completion)(&svms->deferred_list_work)){+.+.}-{0:0}:
>          __flush_work+0x88/0x4f0
>          svm_range_list_lock_and_flush_work+0x3d/0x110 [amdgpu]
>          svm_range_set_attr+0xd6/0x14c0 [amdgpu]
>          kfd_ioctl+0x1d1/0x630 [amdgpu]
>          __x64_sys_ioctl+0x88/0xc0
>
> -> #2 (&info->lock#2){+.+.}-{3:3}:
>          __mutex_lock+0x99/0xc70
>          amdgpu_amdkfd_gpuvm_restore_process_bos+0x54/0x740 [amdgpu]
>          restore_process_helper+0x22/0x80 [amdgpu]
>          restore_process_worker+0x2d/0xa0 [amdgpu]
>          process_one_work+0x29b/0x560
>          worker_thread+0x3d/0x3d0
>
> -> #1 ((work_completion)(&(&process->restore_work)->work)){+.+.}-{0:0}:
>          __flush_work+0x88/0x4f0
>          __cancel_work_timer+0x12c/0x1c0
>          kfd_process_notifier_release_internal+0x37/0x1f0 [amdgpu]
>          __mmu_notifier_release+0xad/0x240
>          exit_mmap+0x6a/0x3a0
>          mmput+0x6a/0x120
>          do_exit+0x322/0xb90
>          do_group_exit+0x37/0xa0
>          __x64_sys_exit_group+0x18/0x20
>          do_syscall_64+0x38/0x80
>
> -> #0 (srcu){.+.+}-{0:0}:
>          __lock_acquire+0x1521/0x2510
>          lock_sync+0x5f/0x90
>          __synchronize_srcu+0x4f/0x1a0
>          __mmu_notifier_release+0x128/0x240
>          exit_mmap+0x6a/0x3a0
>          mmput+0x6a/0x120
>          svm_range_deferred_list_work+0x19f/0x350 [amdgpu]
>          process_one_work+0x29b/0x560
>          worker_thread+0x3d/0x3d0
>
> other info that might help us debug this:
> Chain exists of:
>    srcu --> &info->lock#2 --> (work_completion)(&svms->deferred_list_work)
>
> Possible unsafe locking scenario:
>
>          CPU0                    CPU1
>          ----                    ----
>          lock((work_completion)(&svms->deferred_list_work));
>                          lock(&info->lock#2);
> 			lock((work_completion)(&svms->deferred_list_work));
>          sync(srcu);
>
> Signed-off-by: Philip Yang <Philip.Yang at amd.com>

Reviewed-by: Felix Kuehling <felix.kuehling at amd.com>


> ---
>   drivers/gpu/drm/amd/amdkfd/kfd_svm.c | 6 ++++--
>   1 file changed, 4 insertions(+), 2 deletions(-)
>
> diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_svm.c b/drivers/gpu/drm/amd/amdkfd/kfd_svm.c
> index 2ac0c7788dfc..6c0e6d654a26 100644
> --- a/drivers/gpu/drm/amd/amdkfd/kfd_svm.c
> +++ b/drivers/gpu/drm/amd/amdkfd/kfd_svm.c
> @@ -2340,8 +2340,10 @@ static void svm_range_deferred_list_work(struct work_struct *work)
>   		mutex_unlock(&svms->lock);
>   		mmap_write_unlock(mm);
>   
> -		/* Pairs with mmget in svm_range_add_list_work */
> -		mmput(mm);
> +		/* Pairs with mmget in svm_range_add_list_work. If dropping the
> +		 * last mm refcount, schedule release work to avoid circular locking
> +		 */
> +		mmput_async(mm);
>   
>   		spin_lock(&svms->deferred_list_lock);
>   	}