[PATCH 1/2] drm/amdgpu: Add timeout for sync wait
Christian König
ckoenig.leichtzumerken at gmail.com
Fri Oct 20 07:29:06 UTC 2023
Am 20.10.23 um 08:13 schrieb Emily Deng:
> Issue: Dead heappen during gpu recover, the call sequence as below:
>
> amdgpu_device_gpu_recover->amdgpu_amdkfd_pre_reset->flush_delayed_work->
> amdgpu_amdkfd_gpuvm_restore_process_bos->amdgpu_sync_wait
>
> It is because the amdgpu_sync_wait is waiting for the bad job's fence, and
> never return, so the recover couldn't continue.
>
>
> Signed-off-by: Emily Deng <Emily.Deng at amd.com>
> ---
> drivers/gpu/drm/amd/amdgpu/amdgpu_sync.c | 11 +++++++++--
> 1 file changed, 9 insertions(+), 2 deletions(-)
>
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_sync.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_sync.c
> index dcd8c066bc1f..6253d6aab7f8 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_sync.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_sync.c
> @@ -406,8 +406,15 @@ int amdgpu_sync_wait(struct amdgpu_sync *sync, bool intr)
> int i, r;
>
> hash_for_each_safe(sync->fences, i, tmp, e, node) {
> - r = dma_fence_wait(e->fence, intr);
> - if (r)
> + struct drm_sched_fence *s_fence = to_drm_sched_fence(e->fence);
> + long timeout = msecs_to_jiffies(10000);
That handling doesn't make much sense. If you need a timeout then you
need a timeout for the whole function.
Additional to that timeouts often just hide real problems which needs
fixing.
So this here needs a much better justification otherwise it's a pretty
clear NAK.
Regards,
Christian.
> +
> + if (s_fence)
> + timeout = s_fence->sched->timeout;
> +
> + if (r == 0)
> + r = -ETIMEDOUT;
> + if (r < 0)
> return r;
>
> amdgpu_sync_entry_free(e);
More information about the amd-gfx
mailing list