[PATCH 2/2] drm/amdgpu: Add timeout for sync wait

Tue Oct 24 06:18:46 UTC 2023

Am 20.10.23 um 11:59 schrieb Emily Deng:
> Issue: Dead heappen during gpu recover, the call sequence as below:
>
> amdgpu_device_gpu_recover->amdgpu_amdkfd_pre_reset->flush_delayed_work->
> amdgpu_amdkfd_gpuvm_restore_process_bos->amdgpu_sync_wait

Resolving a deadlock with a timeout is illegal in general. So this patch 
here is an obvious no-go.

Additional to this problem Xinhu already investigated that the delayed 
work is causing issues during suspend because because flushing doesn't 
guarantee that a new one isn't started right after doing that.

After talking with Felix about this the correct solution is to stop 
flushing the delayed work and instead submitting it to the freezable 
work queue.

Regards,
Christian.

>
> It is because the amdgpu_sync_wait is waiting for the bad job's fence, and
> never return, so the recover couldn't continue.
>
> Signed-off-by: Emily Deng <Emily.Deng at amd.com>
> ---
>   drivers/gpu/drm/amd/amdgpu/amdgpu_sync.c | 11 +++++++++--
>   1 file changed, 9 insertions(+), 2 deletions(-)
>
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_sync.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_sync.c
> index dcd8c066bc1f..9d4f122a7bf0 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_sync.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_sync.c
> @@ -406,8 +406,15 @@ int amdgpu_sync_wait(struct amdgpu_sync *sync, bool intr)
>   	int i, r;
>   
>   	hash_for_each_safe(sync->fences, i, tmp, e, node) {
> -		r = dma_fence_wait(e->fence, intr);
> -		if (r)
> +		struct drm_sched_fence *s_fence = to_drm_sched_fence(e->fence);
> +		long timeout = msecs_to_jiffies(10000);
> +
> +		if (s_fence)
> +			timeout = s_fence->sched->timeout;
> +		r = dma_fence_wait_timeout(e->fence, intr, timeout);
> +		if (r == 0)
> +			r = -ETIMEDOUT;
> +		if (r < 0)
>   			return r;
>   
>   		amdgpu_sync_entry_free(e);