[PATCH] drm/amdgpu: change the fence ring wait timeout

Christian König christian.koenig at amd.com
Thu Jan 14 13:49:57 UTC 2021


Am 14.01.21 um 03:00 schrieb Deng, Emily:
> [AMD Official Use Only - Internal Distribution Only]
>
>> -----Original Message-----
>> From: amd-gfx <amd-gfx-bounces at lists.freedesktop.org> On Behalf Of
>> Christian König
>> Sent: Wednesday, January 13, 2021 10:03 PM
>> To: Sun, Roy <Roy.Sun at amd.com>; amd-gfx at lists.freedesktop.org
>> Subject: Re: [PATCH] drm/amdgpu: change the fence ring wait timeout
>>
>> Am 13.01.21 um 07:36 schrieb Roy Sun:
>>> This fix bug where when the engine hang, the fence ring will wait
>>> without quit and cause kernel crash
>> NAK, this blocking is intentional unlimited because otherwise we will cause a
>> memory corruption.
>>
>> What is the actual bug you are trying to fix here?
> When some engine hang during initialization, the IB test will fail, and fence will never come back, then never could wait the fence back. Why we need to wait here forever? We'd better not use forever wait which
> will cause kernel crash and lockup. And we have amdgpu_fence_driver_force_completion will let memory free. We should remove all those forever wait, and replace them with one timeout,  and
> do the correct error process if timeout.

This wait here is to make sure we never overwrite the software fence 
ring buffer. Without it we would not signal all fences in 
amdgpu_fence_driver_force_completion() and cause either memory leak or 
corruption.

Waiting here forever is the right thing to do even when that means that 
the submission thread gets stuck forever, cause that is still better 
than memory corruption.

But this should never happen in practice and is only here for 
precaution. So do you really see that in practice?

Regards,
Christian.

>
>> Regards,
>> Christian.
>>
>>> Signed-off-by: Roy Sun <Roy.Sun at amd.com>
>>> ---
>>>    drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c | 48
>> ++++++++++++++++++++---
>>>    1 file changed, 43 insertions(+), 5 deletions(-)
>>>
>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c
>>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c
>>> index 6b0aeee61b8b..738ea65077ea 100644
>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c
>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c
>>> @@ -41,6 +41,8 @@
>>>    #include "amdgpu.h"
>>>    #include "amdgpu_trace.h"
>>>
>>> +#define AMDGPU_FENCE_TIMEOUT  msecs_to_jiffies(1000) #define
>>> +AMDGPU_FENCE_GFX_XGMI_TIMEOUT msecs_to_jiffies(2000)
>>>    /*
>>>     * Fences
>>>     * Fences mark an event in the GPUs pipeline and are used @@ -104,6
>>> +106,38 @@ static void amdgpu_fence_write(struct amdgpu_ring *ring, u32
>> seq)
>>>    *drv->cpu_addr = cpu_to_le32(seq);
>>>    }
>>>
>>> +/**
>>> + * amdgpu_fence_wait_timeout - get the fence wait timeout
>>> + *
>>> + * @ring: ring the fence is associated with
>>> + *
>>> + * Returns the value of the fence wait timeout.
>>> + */
>>> +long amdgpu_fence_wait_timeout(struct amdgpu_ring *ring) {
>>> +long tmo_gfx, tmo_mm, tmo;
>>> +struct amdgpu_device *adev = ring->adev;
>>> +tmo_mm = tmo_gfx = AMDGPU_FENCE_TIMEOUT;
>>> +if (amdgpu_sriov_vf(adev)) {
>>> +tmo_mm = 8 * AMDGPU_FENCE_TIMEOUT;
>>> +}
>>> +if (amdgpu_sriov_runtime(adev)) {
>>> +tmo_gfx = 8 * AMDGPU_FENCE_TIMEOUT;
>>> +} else if (adev->gmc.xgmi.hive_id) {
>>> +tmo_gfx = AMDGPU_FENCE_GFX_XGMI_TIMEOUT;
>>> +}
>>> +if (ring->funcs->type == AMDGPU_RING_TYPE_UVD ||
>>> +ring->funcs->type == AMDGPU_RING_TYPE_VCE ||
>>> +ring->funcs->type == AMDGPU_RING_TYPE_UVD_ENC ||
>>> +ring->funcs->type == AMDGPU_RING_TYPE_VCN_DEC ||
>>> +ring->funcs->type == AMDGPU_RING_TYPE_VCN_ENC ||
>>> +ring->funcs->type == AMDGPU_RING_TYPE_VCN_JPEG)
>>> +tmo = tmo_mm;
>>> +else
>>> +tmo = tmo_gfx;
>>> +return tmo;
>>> +}
>>> +
>>>    /**
>>>     * amdgpu_fence_read - read a fence value
>>>     *
>>> @@ -166,10 +200,12 @@ int amdgpu_fence_emit(struct amdgpu_ring *ring,
>> struct dma_fence **f,
>>>    rcu_read_unlock();
>>>
>>>    if (old) {
>>> -r = dma_fence_wait(old, false);
>>> +long timeout;
>>> +timeout = amdgpu_fence_wait_timeout(ring);
>>> +r = dma_fence_wait_timeout(old, false, timeout);
>>>    dma_fence_put(old);
>>>    if (r)
>>> -return r;
>>> +return r < 0 ? r : 0;
>>>    }
>>>    }
>>>
>>> @@ -343,10 +379,12 @@ int amdgpu_fence_wait_empty(struct
>> amdgpu_ring *ring)
>>>    return 0;
>>>    }
>>>    rcu_read_unlock();
>>> -
>>> -r = dma_fence_wait(fence, false);
>>> +
>>> +long timeout;
>>> +timeout = amdgpu_fence_wait_timeout(ring);
>>> +r = dma_fence_wait_timeout(fence, false, timeout);
>>>    dma_fence_put(fence);
>>> -return r;
>>> +return r < 0 ? r : 0;
>>>    }
>>>
>>>    /**
>> _______________________________________________
>> amd-gfx mailing list
>> amd-gfx at lists.freedesktop.org
>> https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Flists.f
>> reedesktop.org%2Fmailman%2Flistinfo%2Famd-
>> gfx&data=04%7C01%7CEmily.Deng%40amd.com%7C8b116229938b463
>> df87f08d8b7cbf607%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7
>> C637461433936049544%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAw
>> MDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&sda
>> ta=HOcLHmmblOUHXATFBl5HC6LOmFq0oXglAh2GFwd6sus%3D&reserve
>> d=0



More information about the amd-gfx mailing list