[PATCH v2] drm/amdkfd: kfd open return failed if device is locked
Yang, Philip
Philip.Yang at amd.com
Fri Oct 18 17:36:27 UTC 2019
If device is locked for suspend and resume, kfd open should return
failed -EAGAIN without creating process, otherwise the application exit
to release the process will hang to wait for resume is done if the suspend
and resume is stuck somewhere. This is backtrace:
v2: fix processes that were created before suspend/resume got stuck
[Thu Oct 17 16:43:37 2019] INFO: task rocminfo:3024 blocked for more
than 120 seconds.
[Thu Oct 17 16:43:37 2019] Not tainted
5.0.0-rc1-kfd-compute-rocm-dkms-no-npi-1131 #1
[Thu Oct 17 16:43:37 2019] "echo 0 >
/proc/sys/kernel/hung_task_timeout_secs" disables this message.
[Thu Oct 17 16:43:37 2019] rocminfo D 0 3024 2947
0x80000000
[Thu Oct 17 16:43:37 2019] Call Trace:
[Thu Oct 17 16:43:37 2019] ? __schedule+0x3d9/0x8a0
[Thu Oct 17 16:43:37 2019] schedule+0x32/0x70
[Thu Oct 17 16:43:37 2019] schedule_preempt_disabled+0xa/0x10
[Thu Oct 17 16:43:37 2019] __mutex_lock.isra.9+0x1e3/0x4e0
[Thu Oct 17 16:43:37 2019] ? __call_srcu+0x264/0x3b0
[Thu Oct 17 16:43:37 2019] ? process_termination_cpsch+0x24/0x2f0
[amdgpu]
[Thu Oct 17 16:43:37 2019] process_termination_cpsch+0x24/0x2f0
[amdgpu]
[Thu Oct 17 16:43:37 2019]
kfd_process_dequeue_from_all_devices+0x42/0x60 [amdgpu]
[Thu Oct 17 16:43:37 2019] kfd_process_notifier_release+0x1be/0x220
[amdgpu]
[Thu Oct 17 16:43:37 2019] __mmu_notifier_release+0x3e/0xc0
[Thu Oct 17 16:43:37 2019] exit_mmap+0x160/0x1a0
[Thu Oct 17 16:43:37 2019] ? __handle_mm_fault+0xba3/0x1200
[Thu Oct 17 16:43:37 2019] ? exit_robust_list+0x5a/0x110
[Thu Oct 17 16:43:37 2019] mmput+0x4a/0x120
[Thu Oct 17 16:43:37 2019] do_exit+0x284/0xb20
[Thu Oct 17 16:43:37 2019] ? handle_mm_fault+0xfa/0x200
[Thu Oct 17 16:43:37 2019] do_group_exit+0x3a/0xa0
[Thu Oct 17 16:43:37 2019] __x64_sys_exit_group+0x14/0x20
[Thu Oct 17 16:43:37 2019] do_syscall_64+0x4f/0x100
[Thu Oct 17 16:43:37 2019] entry_SYSCALL_64_after_hwframe+0x44/0xa9
Signed-off-by: Philip Yang <Philip.Yang at amd.com>
---
drivers/gpu/drm/amd/amdkfd/kfd_chardev.c | 6 +++---
drivers/gpu/drm/amd/amdkfd/kfd_process_queue_manager.c | 6 ++++++
2 files changed, 9 insertions(+), 3 deletions(-)
diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_chardev.c b/drivers/gpu/drm/amd/amdkfd/kfd_chardev.c
index d9e36dbf13d5..40d75c39f08e 100644
--- a/drivers/gpu/drm/amd/amdkfd/kfd_chardev.c
+++ b/drivers/gpu/drm/amd/amdkfd/kfd_chardev.c
@@ -120,13 +120,13 @@ static int kfd_open(struct inode *inode, struct file *filep)
return -EPERM;
}
+ if (kfd_is_locked())
+ return -EAGAIN;
+
process = kfd_create_process(filep);
if (IS_ERR(process))
return PTR_ERR(process);
- if (kfd_is_locked())
- return -EAGAIN;
-
dev_dbg(kfd_device, "process %d opened, compat mode (32 bit) - %d\n",
process->pasid, process->is_32bit_user_mode);
diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_process_queue_manager.c b/drivers/gpu/drm/amd/amdkfd/kfd_process_queue_manager.c
index 8509814a6ff0..3784013b92a0 100644
--- a/drivers/gpu/drm/amd/amdkfd/kfd_process_queue_manager.c
+++ b/drivers/gpu/drm/amd/amdkfd/kfd_process_queue_manager.c
@@ -128,6 +128,12 @@ void kfd_process_dequeue_from_all_devices(struct kfd_process *p)
{
struct kfd_process_device *pdd;
+ /* If suspend/resume got stuck, dqm_lock is hold,
+ * skip process_termination_cpsch to avoid deadlock
+ */
+ if (kfd_is_locked())
+ return;
+
list_for_each_entry(pdd, &p->per_device_data, per_device_list)
kfd_process_dequeue_from_device(pdd);
}
--
2.17.1
More information about the amd-gfx
mailing list