[PATCH 2/2] drm/amdgpu: Add timeout for sync wait
Emily Deng
Emily.Deng at amd.com
Fri Oct 20 09:59:11 UTC 2023
Issue: Dead heappen during gpu recover, the call sequence as below:
amdgpu_device_gpu_recover->amdgpu_amdkfd_pre_reset->flush_delayed_work->
amdgpu_amdkfd_gpuvm_restore_process_bos->amdgpu_sync_wait
It is because the amdgpu_sync_wait is waiting for the bad job's fence, and
never return, so the recover couldn't continue.
Signed-off-by: Emily Deng <Emily.Deng at amd.com>
---
drivers/gpu/drm/amd/amdgpu/amdgpu_sync.c | 11 +++++++++--
1 file changed, 9 insertions(+), 2 deletions(-)
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_sync.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_sync.c
index dcd8c066bc1f..9d4f122a7bf0 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_sync.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_sync.c
@@ -406,8 +406,15 @@ int amdgpu_sync_wait(struct amdgpu_sync *sync, bool intr)
int i, r;
hash_for_each_safe(sync->fences, i, tmp, e, node) {
- r = dma_fence_wait(e->fence, intr);
- if (r)
+ struct drm_sched_fence *s_fence = to_drm_sched_fence(e->fence);
+ long timeout = msecs_to_jiffies(10000);
+
+ if (s_fence)
+ timeout = s_fence->sched->timeout;
+ r = dma_fence_wait_timeout(e->fence, intr, timeout);
+ if (r == 0)
+ r = -ETIMEDOUT;
+ if (r < 0)
return r;
amdgpu_sync_entry_free(e);
--
2.36.1
More information about the amd-gfx
mailing list