[PATCH 1/2] drm/amdgpu: Add timeout for sync wait

Fri Oct 20 19:47:07 UTC 2023

On 2023-10-20 09:10, Christian König wrote:
> No, the wait forever is what is expected and perfectly valid user 
> experience.
>
> Waiting with a timeout on the other hand sounds like a really bad idea 
> to me.
>
> Every wait with a timeout needs a justification, e.g. for example that 
> userspace explicitly specified it. And I absolutely don't see that here.
In this case the wait is in a kernel worker thread, and the wait is not 
interruptible. Not having a timeout means, you can have a kernel worker 
stuck forever. The restore worker also has retry logic already, so it 
can handle a timeout perfectly well. But maybe this shouldn't be done 
automatically for all callers of amdgpu_sync_wait, but only for this 
particular caller in the restore_process_worker. So we'd need to add a 
timeout parameter to amdgpu_sync_wait.

Regards,
   Felix


>
> Regards,
> Christian.
>
> Am 20.10.23 um 10:52 schrieb Deng, Emily:
>> [AMD Official Use Only - General]
>>
>> Hi Christian,
>>       The issue is running a compute hang with a quark and trigger a 
>> compute job timeout. For compute, the timeout setting is 60s, but for 
>> gfx and sdma, it is 10s.
>> So, get the timeout from the sched is reasonable for different sched.
>>      And if wait timeout, it will print error, so won't hint real 
>> issues. And even it has real issue, the wait forever is bad user 
>> experience, and driver couldn't work anymore.
>>
>> Emily Deng
>> Best Wishes
>>
>>
>>
>>> -----Original Message-----
>>> From: Christian König <ckoenig.leichtzumerken at gmail.com>
>>> Sent: Friday, October 20, 2023 3:29 PM
>>> To: Deng, Emily <Emily.Deng at amd.com>; amd-gfx at lists.freedesktop.org
>>> Subject: Re: [PATCH 1/2] drm/amdgpu: Add timeout for sync wait
>>>
>>> Am 20.10.23 um 08:13 schrieb Emily Deng:
>>>> Issue: Dead heappen during gpu recover, the call sequence as below:
>>>>
>>>> amdgpu_device_gpu_recover->amdgpu_amdkfd_pre_reset-
>>>> flush_delayed_work
>>>> -> amdgpu_amdkfd_gpuvm_restore_process_bos->amdgpu_sync_wait
>>>>
>>>> It is because the amdgpu_sync_wait is waiting for the bad job's fence,
>>>> and never return, so the recover couldn't continue.
>>>>
>>>>
>>>> Signed-off-by: Emily Deng <Emily.Deng at amd.com>
>>>> ---
>>>>    drivers/gpu/drm/amd/amdgpu/amdgpu_sync.c | 11 +++++++++--
>>>>    1 file changed, 9 insertions(+), 2 deletions(-)
>>>>
>>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_sync.c
>>>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_sync.c
>>>> index dcd8c066bc1f..6253d6aab7f8 100644
>>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_sync.c
>>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_sync.c
>>>> @@ -406,8 +406,15 @@ int amdgpu_sync_wait(struct amdgpu_sync *sync,
>>> bool intr)
>>>>       int i, r;
>>>>
>>>>       hash_for_each_safe(sync->fences, i, tmp, e, node) {
>>>> -            r = dma_fence_wait(e->fence, intr);
>>>> -            if (r)
>>>> +            struct drm_sched_fence *s_fence = to_drm_sched_fence(e-
>>>> fence);
>>>> +            long timeout = msecs_to_jiffies(10000);
>>> That handling doesn't make much sense. If you need a timeout then 
>>> you need
>>> a timeout for the whole function.
>>>
>>> Additional to that timeouts often just hide real problems which 
>>> needs fixing.
>>>
>>> So this here needs a much better justification otherwise it's a 
>>> pretty clear NAK.
>>>
>>> Regards,
>>> Christian.
>>>
>>>> +
>>>> +            if (s_fence)
>>>> +                    timeout = s_fence->sched->timeout;
>>>> +
>>>> +            if (r == 0)
>>>> +                    r = -ETIMEDOUT;
>>>> +            if (r < 0)
>>>>                       return r;
>>>>
>>>>               amdgpu_sync_entry_free(e);
>