[PATCH 1/2] drm/amdgpu: Add timeout for sync wait

Deng, Emily Emily.Deng at amd.com
Fri Oct 20 08:52:44 UTC 2023


[AMD Official Use Only - General]

Hi Christian,
     The issue is running a compute hang with a quark and trigger a compute job timeout. For compute, the timeout setting is 60s, but for gfx and sdma, it is 10s.
So, get the timeout from the sched is reasonable for different sched.
    And if wait timeout, it will print error, so won't hint real issues. And even it has real issue, the wait forever is bad user experience, and driver couldn't work anymore.

Emily Deng
Best Wishes



>-----Original Message-----
>From: Christian König <ckoenig.leichtzumerken at gmail.com>
>Sent: Friday, October 20, 2023 3:29 PM
>To: Deng, Emily <Emily.Deng at amd.com>; amd-gfx at lists.freedesktop.org
>Subject: Re: [PATCH 1/2] drm/amdgpu: Add timeout for sync wait
>
>Am 20.10.23 um 08:13 schrieb Emily Deng:
>> Issue: Dead heappen during gpu recover, the call sequence as below:
>>
>> amdgpu_device_gpu_recover->amdgpu_amdkfd_pre_reset-
>>flush_delayed_work
>> -> amdgpu_amdkfd_gpuvm_restore_process_bos->amdgpu_sync_wait
>>
>> It is because the amdgpu_sync_wait is waiting for the bad job's fence,
>> and never return, so the recover couldn't continue.
>>
>>
>> Signed-off-by: Emily Deng <Emily.Deng at amd.com>
>> ---
>>   drivers/gpu/drm/amd/amdgpu/amdgpu_sync.c | 11 +++++++++--
>>   1 file changed, 9 insertions(+), 2 deletions(-)
>>
>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_sync.c
>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_sync.c
>> index dcd8c066bc1f..6253d6aab7f8 100644
>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_sync.c
>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_sync.c
>> @@ -406,8 +406,15 @@ int amdgpu_sync_wait(struct amdgpu_sync *sync,
>bool intr)
>>      int i, r;
>>
>>      hash_for_each_safe(sync->fences, i, tmp, e, node) {
>> -            r = dma_fence_wait(e->fence, intr);
>> -            if (r)
>> +            struct drm_sched_fence *s_fence = to_drm_sched_fence(e-
>>fence);
>> +            long timeout = msecs_to_jiffies(10000);
>
>That handling doesn't make much sense. If you need a timeout then you need
>a timeout for the whole function.
>
>Additional to that timeouts often just hide real problems which needs fixing.
>
>So this here needs a much better justification otherwise it's a pretty clear NAK.
>
>Regards,
>Christian.
>
>> +
>> +            if (s_fence)
>> +                    timeout = s_fence->sched->timeout;
>> +
>> +            if (r == 0)
>> +                    r = -ETIMEDOUT;
>> +            if (r < 0)
>>                      return r;
>>
>>              amdgpu_sync_entry_free(e);



More information about the amd-gfx mailing list