[PATCH 1/2] drm/amdgpu: Add timeout for sync wait

Mon Oct 23 11:39:30 UTC 2023

Am 20.10.23 um 21:47 schrieb Felix Kuehling:
>
> On 2023-10-20 09:10, Christian König wrote:
>> No, the wait forever is what is expected and perfectly valid user 
>> experience.
>>
>> Waiting with a timeout on the other hand sounds like a really bad 
>> idea to me.
>>
>> Every wait with a timeout needs a justification, e.g. for example 
>> that userspace explicitly specified it. And I absolutely don't see 
>> that here.
> In this case the wait is in a kernel worker thread, and the wait is 
> not interruptible. Not having a timeout means, you can have a kernel 
> worker stuck forever. The restore worker also has retry logic already, 
> so it can handle a timeout perfectly well. But maybe this shouldn't be 
> done automatically for all callers of amdgpu_sync_wait, but only for 
> this particular caller in the restore_process_worker. So we'd need to 
> add a timeout parameter to amdgpu_sync_wait.

Adding a parameter sounds like a good idea to me, but it's mandatory 
that dma_fence operations finish in a reasonable amount of time in the 
first place.

This is even documented by now and basically means we need timeouts in 
the area of 100ms for each operation and not between 10 and 60 seconds.

If upstream starts to taint the kernel for longer timeouts we will need 
to reduce the current values massively.

Regards,
Christian.

>
> Regards,
>   Felix
>
>
>>
>> Regards,
>> Christian.
>>
>> Am 20.10.23 um 10:52 schrieb Deng, Emily:
>>> [AMD Official Use Only - General]
>>>
>>> Hi Christian,
>>>       The issue is running a compute hang with a quark and trigger a 
>>> compute job timeout. For compute, the timeout setting is 60s, but 
>>> for gfx and sdma, it is 10s.
>>> So, get the timeout from the sched is reasonable for different sched.
>>>      And if wait timeout, it will print error, so won't hint real 
>>> issues. And even it has real issue, the wait forever is bad user 
>>> experience, and driver couldn't work anymore.
>>>
>>> Emily Deng
>>> Best Wishes
>>>
>>>
>>>
>>>> -----Original Message-----
>>>> From: Christian König <ckoenig.leichtzumerken at gmail.com>
>>>> Sent: Friday, October 20, 2023 3:29 PM
>>>> To: Deng, Emily <Emily.Deng at amd.com>; amd-gfx at lists.freedesktop.org
>>>> Subject: Re: [PATCH 1/2] drm/amdgpu: Add timeout for sync wait
>>>>
>>>> Am 20.10.23 um 08:13 schrieb Emily Deng:
>>>>> Issue: Dead heappen during gpu recover, the call sequence as below:
>>>>>
>>>>> amdgpu_device_gpu_recover->amdgpu_amdkfd_pre_reset-
>>>>> flush_delayed_work
>>>>> -> amdgpu_amdkfd_gpuvm_restore_process_bos->amdgpu_sync_wait
>>>>>
>>>>> It is because the amdgpu_sync_wait is waiting for the bad job's 
>>>>> fence,
>>>>> and never return, so the recover couldn't continue.
>>>>>
>>>>>
>>>>> Signed-off-by: Emily Deng <Emily.Deng at amd.com>
>>>>> ---
>>>>>    drivers/gpu/drm/amd/amdgpu/amdgpu_sync.c | 11 +++++++++--
>>>>>    1 file changed, 9 insertions(+), 2 deletions(-)
>>>>>
>>>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_sync.c
>>>>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_sync.c
>>>>> index dcd8c066bc1f..6253d6aab7f8 100644
>>>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_sync.c
>>>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_sync.c
>>>>> @@ -406,8 +406,15 @@ int amdgpu_sync_wait(struct amdgpu_sync *sync,
>>>> bool intr)
>>>>>       int i, r;
>>>>>
>>>>>       hash_for_each_safe(sync->fences, i, tmp, e, node) {
>>>>> -            r = dma_fence_wait(e->fence, intr);
>>>>> -            if (r)
>>>>> +            struct drm_sched_fence *s_fence = to_drm_sched_fence(e-
>>>>> fence);
>>>>> +            long timeout = msecs_to_jiffies(10000);
>>>> That handling doesn't make much sense. If you need a timeout then 
>>>> you need
>>>> a timeout for the whole function.
>>>>
>>>> Additional to that timeouts often just hide real problems which 
>>>> needs fixing.
>>>>
>>>> So this here needs a much better justification otherwise it's a 
>>>> pretty clear NAK.
>>>>
>>>> Regards,
>>>> Christian.
>>>>
>>>>> +
>>>>> +            if (s_fence)
>>>>> +                    timeout = s_fence->sched->timeout;
>>>>> +
>>>>> +            if (r == 0)
>>>>> +                    r = -ETIMEDOUT;
>>>>> +            if (r < 0)
>>>>>                       return r;
>>>>>
>>>>>               amdgpu_sync_entry_free(e);
>>