[PATCH v2] drm/amdkfd: kfd open return failed if device is locked
Kuehling, Felix
Felix.Kuehling at amd.com
Fri Oct 18 19:08:21 UTC 2019
On 2019-10-18 1:36 p.m., Yang, Philip wrote:
> If device is locked for suspend and resume, kfd open should return
> failed -EAGAIN without creating process, otherwise the application exit
> to release the process will hang to wait for resume is done if the suspend
> and resume is stuck somewhere. This is backtrace:
>
> v2: fix processes that were created before suspend/resume got stuck
>
> [Thu Oct 17 16:43:37 2019] INFO: task rocminfo:3024 blocked for more
> than 120 seconds.
> [Thu Oct 17 16:43:37 2019] Not tainted
> 5.0.0-rc1-kfd-compute-rocm-dkms-no-npi-1131 #1
> [Thu Oct 17 16:43:37 2019] "echo 0 >
> /proc/sys/kernel/hung_task_timeout_secs" disables this message.
> [Thu Oct 17 16:43:37 2019] rocminfo D 0 3024 2947
> 0x80000000
> [Thu Oct 17 16:43:37 2019] Call Trace:
> [Thu Oct 17 16:43:37 2019] ? __schedule+0x3d9/0x8a0
> [Thu Oct 17 16:43:37 2019] schedule+0x32/0x70
> [Thu Oct 17 16:43:37 2019] schedule_preempt_disabled+0xa/0x10
> [Thu Oct 17 16:43:37 2019] __mutex_lock.isra.9+0x1e3/0x4e0
> [Thu Oct 17 16:43:37 2019] ? __call_srcu+0x264/0x3b0
> [Thu Oct 17 16:43:37 2019] ? process_termination_cpsch+0x24/0x2f0
> [amdgpu]
> [Thu Oct 17 16:43:37 2019] process_termination_cpsch+0x24/0x2f0
> [amdgpu]
> [Thu Oct 17 16:43:37 2019]
> kfd_process_dequeue_from_all_devices+0x42/0x60 [amdgpu]
> [Thu Oct 17 16:43:37 2019] kfd_process_notifier_release+0x1be/0x220
> [amdgpu]
> [Thu Oct 17 16:43:37 2019] __mmu_notifier_release+0x3e/0xc0
> [Thu Oct 17 16:43:37 2019] exit_mmap+0x160/0x1a0
> [Thu Oct 17 16:43:37 2019] ? __handle_mm_fault+0xba3/0x1200
> [Thu Oct 17 16:43:37 2019] ? exit_robust_list+0x5a/0x110
> [Thu Oct 17 16:43:37 2019] mmput+0x4a/0x120
> [Thu Oct 17 16:43:37 2019] do_exit+0x284/0xb20
> [Thu Oct 17 16:43:37 2019] ? handle_mm_fault+0xfa/0x200
> [Thu Oct 17 16:43:37 2019] do_group_exit+0x3a/0xa0
> [Thu Oct 17 16:43:37 2019] __x64_sys_exit_group+0x14/0x20
> [Thu Oct 17 16:43:37 2019] do_syscall_64+0x4f/0x100
> [Thu Oct 17 16:43:37 2019] entry_SYSCALL_64_after_hwframe+0x44/0xa9
>
> Signed-off-by: Philip Yang <Philip.Yang at amd.com>
> ---
> drivers/gpu/drm/amd/amdkfd/kfd_chardev.c | 6 +++---
> drivers/gpu/drm/amd/amdkfd/kfd_process_queue_manager.c | 6 ++++++
> 2 files changed, 9 insertions(+), 3 deletions(-)
>
> diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_chardev.c b/drivers/gpu/drm/amd/amdkfd/kfd_chardev.c
> index d9e36dbf13d5..40d75c39f08e 100644
> --- a/drivers/gpu/drm/amd/amdkfd/kfd_chardev.c
> +++ b/drivers/gpu/drm/amd/amdkfd/kfd_chardev.c
> @@ -120,13 +120,13 @@ static int kfd_open(struct inode *inode, struct file *filep)
> return -EPERM;
> }
>
> + if (kfd_is_locked())
> + return -EAGAIN;
> +
> process = kfd_create_process(filep);
> if (IS_ERR(process))
> return PTR_ERR(process);
>
> - if (kfd_is_locked())
> - return -EAGAIN;
> -
> dev_dbg(kfd_device, "process %d opened, compat mode (32 bit) - %d\n",
> process->pasid, process->is_32bit_user_mode);
>
> diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_process_queue_manager.c b/drivers/gpu/drm/amd/amdkfd/kfd_process_queue_manager.c
> index 8509814a6ff0..3784013b92a0 100644
> --- a/drivers/gpu/drm/amd/amdkfd/kfd_process_queue_manager.c
> +++ b/drivers/gpu/drm/amd/amdkfd/kfd_process_queue_manager.c
> @@ -128,6 +128,12 @@ void kfd_process_dequeue_from_all_devices(struct kfd_process *p)
> {
> struct kfd_process_device *pdd;
>
> + /* If suspend/resume got stuck, dqm_lock is hold,
> + * skip process_termination_cpsch to avoid deadlock
> + */
> + if (kfd_is_locked())
> + return;
> +
Holding the DQM lock during reset has caused other problems (lock
dependency issues and deadlocks) and I was thinking about getting rid of
that completely. The intention of holding the DQM lock during reset was
to prevent the device queue manager from accessing the CP hardware while
a reset was in progress. However, I think there are smarter ways to
achieve that. We already get a pre-reset callback (kgd2kfd_pre_reset)
that executes the kgd2kfd_suspend, which suspends processes and stops
DQM through kfd->dqm->ops.stop(kfd->dqm). This should take care of most
of the problem. If there are any places in DQM that try to access the
devices, they should add conditions to not access HW while DQM is
stopped. Then we could avoid holding a lock indefinitely while a reset
is in progress.
The DQM lock is particularly problematic in terms of lock dependencies
because it can be taken in MMU notifiers. We want to avoid taking any
other locks while holding the DQM lock. Holding the DQM lock for a long
time during reset is counterproductive to that objective.
Regards,
Felix
> list_for_each_entry(pdd, &p->per_device_data, per_device_list)
> kfd_process_dequeue_from_device(pdd);
> }
More information about the amd-gfx
mailing list