[PATCH v2] drm/amdkfd: kfd open return failed if device is locked

Fri Oct 18 19:08:21 UTC 2019

On 2019-10-18 1:36 p.m., Yang, Philip wrote:
> If device is locked for suspend and resume, kfd open should return
> failed -EAGAIN without creating process, otherwise the application exit
> to release the process will hang to wait for resume is done if the suspend
> and resume is stuck somewhere. This is backtrace:
>
> v2: fix processes that were created before suspend/resume got stuck
>
> [Thu Oct 17 16:43:37 2019] INFO: task rocminfo:3024 blocked for more
> than 120 seconds.
> [Thu Oct 17 16:43:37 2019]       Not tainted
> 5.0.0-rc1-kfd-compute-rocm-dkms-no-npi-1131 #1
> [Thu Oct 17 16:43:37 2019] "echo 0 >
> /proc/sys/kernel/hung_task_timeout_secs" disables this message.
> [Thu Oct 17 16:43:37 2019] rocminfo        D    0  3024   2947
> 0x80000000
> [Thu Oct 17 16:43:37 2019] Call Trace:
> [Thu Oct 17 16:43:37 2019]  ? __schedule+0x3d9/0x8a0
> [Thu Oct 17 16:43:37 2019]  schedule+0x32/0x70
> [Thu Oct 17 16:43:37 2019]  schedule_preempt_disabled+0xa/0x10
> [Thu Oct 17 16:43:37 2019]  __mutex_lock.isra.9+0x1e3/0x4e0
> [Thu Oct 17 16:43:37 2019]  ? __call_srcu+0x264/0x3b0
> [Thu Oct 17 16:43:37 2019]  ? process_termination_cpsch+0x24/0x2f0
> [amdgpu]
> [Thu Oct 17 16:43:37 2019]  process_termination_cpsch+0x24/0x2f0
> [amdgpu]
> [Thu Oct 17 16:43:37 2019]
> kfd_process_dequeue_from_all_devices+0x42/0x60 [amdgpu]
> [Thu Oct 17 16:43:37 2019]  kfd_process_notifier_release+0x1be/0x220
> [amdgpu]
> [Thu Oct 17 16:43:37 2019]  __mmu_notifier_release+0x3e/0xc0
> [Thu Oct 17 16:43:37 2019]  exit_mmap+0x160/0x1a0
> [Thu Oct 17 16:43:37 2019]  ? __handle_mm_fault+0xba3/0x1200
> [Thu Oct 17 16:43:37 2019]  ? exit_robust_list+0x5a/0x110
> [Thu Oct 17 16:43:37 2019]  mmput+0x4a/0x120
> [Thu Oct 17 16:43:37 2019]  do_exit+0x284/0xb20
> [Thu Oct 17 16:43:37 2019]  ? handle_mm_fault+0xfa/0x200
> [Thu Oct 17 16:43:37 2019]  do_group_exit+0x3a/0xa0
> [Thu Oct 17 16:43:37 2019]  __x64_sys_exit_group+0x14/0x20
> [Thu Oct 17 16:43:37 2019]  do_syscall_64+0x4f/0x100
> [Thu Oct 17 16:43:37 2019]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
>
> Signed-off-by: Philip Yang <Philip.Yang at amd.com>
> ---
>   drivers/gpu/drm/amd/amdkfd/kfd_chardev.c               | 6 +++---
>   drivers/gpu/drm/amd/amdkfd/kfd_process_queue_manager.c | 6 ++++++
>   2 files changed, 9 insertions(+), 3 deletions(-)
>
> diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_chardev.c b/drivers/gpu/drm/amd/amdkfd/kfd_chardev.c
> index d9e36dbf13d5..40d75c39f08e 100644
> --- a/drivers/gpu/drm/amd/amdkfd/kfd_chardev.c
> +++ b/drivers/gpu/drm/amd/amdkfd/kfd_chardev.c
> @@ -120,13 +120,13 @@ static int kfd_open(struct inode *inode, struct file *filep)
>   		return -EPERM;
>   	}
>   
> +	if (kfd_is_locked())
> +		return -EAGAIN;
> +
>   	process = kfd_create_process(filep);
>   	if (IS_ERR(process))
>   		return PTR_ERR(process);
>   
> -	if (kfd_is_locked())
> -		return -EAGAIN;
> -
>   	dev_dbg(kfd_device, "process %d opened, compat mode (32 bit) - %d\n",
>   		process->pasid, process->is_32bit_user_mode);
>   
> diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_process_queue_manager.c b/drivers/gpu/drm/amd/amdkfd/kfd_process_queue_manager.c
> index 8509814a6ff0..3784013b92a0 100644
> --- a/drivers/gpu/drm/amd/amdkfd/kfd_process_queue_manager.c
> +++ b/drivers/gpu/drm/amd/amdkfd/kfd_process_queue_manager.c
> @@ -128,6 +128,12 @@ void kfd_process_dequeue_from_all_devices(struct kfd_process *p)
>   {
>   	struct kfd_process_device *pdd;
>   
> +	/* If suspend/resume got stuck, dqm_lock is hold,
> +	 * skip process_termination_cpsch to avoid deadlock
> +	 */
> +	if (kfd_is_locked())
> +		return;
> +

Holding the DQM lock during reset has caused other problems (lock 
dependency issues and deadlocks) and I was thinking about getting rid of 
that completely. The intention of holding the DQM lock during reset was 
to prevent the device queue manager from accessing the CP hardware while 
a reset was in progress. However, I think there are smarter ways to 
achieve that. We already get a pre-reset callback (kgd2kfd_pre_reset) 
that executes the kgd2kfd_suspend, which suspends processes and stops 
DQM through kfd->dqm->ops.stop(kfd->dqm). This should take care of most 
of the problem. If there are any places in DQM that try to access the 
devices, they should add conditions to not access HW while DQM is 
stopped. Then we could avoid holding a lock indefinitely while a reset 
is in progress.

The DQM lock is particularly problematic in terms of lock dependencies 
because it can be taken in MMU notifiers. We want to avoid taking any 
other locks while holding the DQM lock. Holding the DQM lock for a long 
time during reset is counterproductive to that objective.

Regards,
   Felix

>   	list_for_each_entry(pdd, &p->per_device_data, per_device_list)
>   		kfd_process_dequeue_from_device(pdd);
>   }