<html><head> <meta http-equiv="Content-Type" content="text/html; charset=utf-8"> </head> <body> <div class="moz-cite-prefix">On 2024-08-28 18:01, Felix Kuehling wrote: </div> <blockquote type="cite" cite="mid:4d14806d-c6c1-4846-8ac3-459140d7625f@amd.com"> On 2024-08-23 15:49, Philip Yang wrote: <blockquote type="cite">If GPU reset kick in while KFD restore_process_worker running, this may causes different issues, for example below rcu stall warning, because restore work may move BOs and evict queues under VRAM pressure. Fix this race by taking adev reset_domain read semaphore to prevent GPU reset in restore_process_worker, the reset read semaphore can be taken recursively if adev have multiple partitions. </blockquote> Are you sure that an rw_sem can be read-locked recursively in the same thread. I can't find any evidence that this is true. </blockquote> yes, down_read_trylock(&adev->reset_domain->sem) still return 1 for successful inside down_read_trylock(&adev->reset_domain->sem). This works fine in many path now, for example execute_queues_cpsch down read lock reset_domain->sem -> unmap_queues_cpsch down read lock reset_domain->sem again. Regards, Philip <blockquote type="cite" cite="mid:4d14806d-c6c1-4846-8ac3-459140d7625f@amd.com"> Regards, Felix <blockquote type="cite"> Then there is live locking issue if CP hangs while restore_process_worker runs, then GPU reset wait for semaphore to start and restore_process_worker cannot finish to release semaphore. We need signal eviction fence to solve the live locking if evict queue return -ETIMEOUT (for MES path) or -ETIME (for HWS path) because CP hangs, amdgpu 0000:af:00.0: amdgpu: GPU reset(21) succeeded! rcu: INFO: rcu_sched self-detected stall on CPU Workqueue: kfd_restore_wq restore_process_worker [amdgpu] Call Trace: update_process_times+0x94/0xd0 RIP: 0010:amdgpu_vm_handle_moved+0x9a/0x210 [amdgpu] amdgpu_amdkfd_gpuvm_restore_process_bos+0x3d6/0x7d0 [amdgpu] restore_process_helper+0x27/0x80 [amdgpu] Signed-off-by: Philip Yang <a class="moz-txt-link-rfc2396E" href="mailto:Philip.Yang@amd.com"><Philip.Yang@amd.com></a> --- drivers/gpu/drm/amd/amdkfd/kfd_process.c | 56 +++++++++++++++++++++++- 1 file changed, 55 insertions(+), 1 deletion(-) diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_process.c b/drivers/gpu/drm/amd/amdkfd/kfd_process.c index a902950cc060..53a814347522 100644 --- a/drivers/gpu/drm/amd/amdkfd/kfd_process.c +++ b/drivers/gpu/drm/amd/amdkfd/kfd_process.c @@ -35,6 +35,7 @@ #include <linux/pm_runtime.h> #include "amdgpu_amdkfd.h" #include "amdgpu.h" +#include "amdgpu_reset.h" struct mm_struct; @@ -1972,8 +1973,14 @@ static void evict_process_worker(struct work_struct *work) kfd_process_restore_queues(p); pr_debug("Finished evicting pasid 0x%x\n", p->pasid); - } else + } else if (ret == -ETIMEDOUT || ret == -ETIME) { + /* If CP hangs, signal the eviction fence, then restore_bo_worker + * can finish to up_read GPU reset semaphore to start GPU reset. + */ + signal_eviction_fence(p); + } else { pr_err("Failed to evict queues of pasid 0x%x\n", p->pasid); + } } static int restore_process_helper(struct kfd_process *p) @@ -1997,6 +2004,45 @@ static int restore_process_helper(struct kfd_process *p) return ret; } +/* + * kfd_hold_devices_reset_semaphore + * + * return: + * true : hold reset domain semaphore to prevent device reset + * false: one of the device is resetting or already reset + * + */ +static bool kfd_hold_devices_reset_semaphore(struct kfd_process *p) +{ + struct amdgpu_device *adev; + int i; + + for (i = 0; i < p->n_pdds; i++) { + adev = p->pdds[i]->dev->adev; + if (!down_read_trylock(&adev->reset_domain->sem)) + goto out_upread; + } + return true; + +out_upread: + while (i--) { + adev = p->pdds[i]->dev->adev; + up_read(&adev->reset_domain->sem); + } + return false; +} + +static void kfd_unhold_devices_reset_semaphore(struct kfd_process *p) +{ + struct amdgpu_device *adev; + int i; + + for (i = 0; i < p->n_pdds; i++) { + adev = p->pdds[i]->dev->adev; + up_read(&adev->reset_domain->sem); + } +} + static void restore_process_worker(struct work_struct *work) { struct delayed_work *dwork; @@ -2009,6 +2055,12 @@ static void restore_process_worker(struct work_struct *work) * lifetime of this thread, kfd_process p will be valid */ p = container_of(dwork, struct kfd_process, restore_work); + + if (!kfd_hold_devices_reset_semaphore(p)) { + pr_debug("GPU resetting, restore bo and queue skipped\n"); + return; + } + pr_debug("Started restoring pasid 0x%x\n", p->pasid); /* Setting last_restore_timestamp before successful restoration. @@ -2031,6 +2083,8 @@ static void restore_process_worker(struct work_struct *work) msecs_to_jiffies(PROCESS_RESTORE_TIME_MS))) kfd_process_restore_queues(p); } + + kfd_unhold_devices_reset_semaphore(p); } void kfd_suspend_all_processes(void) </blockquote> </blockquote> </body> </html>