[PATCH] drm/amdgpu: change the fence ring wait timeout

Mon Jan 18 16:03:43 UTC 2021

Am 18.01.21 um 12:56 schrieb Deng, Emily:
> [AMD Official Use Only - Internal Distribution Only]
>
>> -----Original Message-----
>> From: Koenig, Christian <Christian.Koenig at amd.com>
>> Sent: Monday, January 18, 2021 3:49 PM
>> To: Deng, Emily <Emily.Deng at amd.com>; Sun, Roy <Roy.Sun at amd.com>;
>> amd-gfx at lists.freedesktop.org
>> Subject: Re: [PATCH] drm/amdgpu: change the fence ring wait timeout
>>
>> Mhm, we could change amdgpu_fence_wait_empty() to timeout. But I think
>> that waiting forever here is intentional and the right thing to do.
>>
>> What happens is that we wait for the hardware to make sure that nothing is
>> writing to any memory before we unload the driver.
>>
>> Now the VCN block has crashed and doesn't respond, but we can't guarantee
>> that it is not accidentally writing anywhere.
>>
>> The only alternative we have is to time out and proceed with the driver unload,
>> risking corrupting the memory we free during that should the hardware
>> continue to do something.
> Hi Christian,
>      Thanks your suggestion, but still not quite clearly, could you detail the solution to avoid kernel not lockup?

Well as I said that the kernel locks up is intentional here.

Regards,
Christian.

>> Regards,
>> Christian.
>>
>> Am 18.01.21 um 03:01 schrieb Deng, Emily:
>>> [AMD Official Use Only - Internal Distribution Only]
>>>
>>>> -----Original Message-----
>>>> From: Koenig, Christian <Christian.Koenig at amd.com>
>>>> Sent: Thursday, January 14, 2021 9:50 PM
>>>> To: Deng, Emily <Emily.Deng at amd.com>; Sun, Roy <Roy.Sun at amd.com>;
>>>> amd-gfx at lists.freedesktop.org
>>>> Subject: Re: [PATCH] drm/amdgpu: change the fence ring wait timeout
>>>>
>>>> Am 14.01.21 um 03:00 schrieb Deng, Emily:
>>>>> [AMD Official Use Only - Internal Distribution Only]
>>>>>
>>>>>> -----Original Message-----
>>>>>> From: amd-gfx <amd-gfx-bounces at lists.freedesktop.org> On Behalf Of
>>>>>> Christian König
>>>>>> Sent: Wednesday, January 13, 2021 10:03 PM
>>>>>> To: Sun, Roy <Roy.Sun at amd.com>; amd-gfx at lists.freedesktop.org
>>>>>> Subject: Re: [PATCH] drm/amdgpu: change the fence ring wait timeout
>>>>>>
>>>>>> Am 13.01.21 um 07:36 schrieb Roy Sun:
>>>>>>> This fix bug where when the engine hang, the fence ring will wait
>>>>>>> without quit and cause kernel crash
>>>>>> NAK, this blocking is intentional unlimited because otherwise we
>>>>>> will cause a memory corruption.
>>>>>>
>>>>>> What is the actual bug you are trying to fix here?
>>>>> When some engine hang during initialization, the IB test will fail,
>>>>> and fence will never come back, then never could wait the fence back.
>>>>> Why we need to wait here forever? We'd better not use forever wait
>>>>> which
>>>> will cause kernel crash and lockup. And we have
>>>> amdgpu_fence_driver_force_completion will let memory free. We should
>>>> remove all those forever wait, and replace them with one timeout,
>>>> and do the correct error process if timeout.
>>>>
>>>> This wait here is to make sure we never overwrite the software fence
>>>> ring buffer. Without it we would not signal all fences in
>>>> amdgpu_fence_driver_force_completion() and cause either memory leak
>>>> or corruption.
>>>>
>>>> Waiting here forever is the right thing to do even when that means
>>>> that the submission thread gets stuck forever, cause that is still
>>>> better than memory corruption.
>>>>
>>>> But this should never happen in practice and is only here for
>>>> precaution. So do you really see that in practice?
>>> Yes, we hit the issue when vcn ib test fail. Could you give some suggestions
>> about how to fix this?
>>> [  958.301685] failed to read reg:1a6c0
>>>
>>> [  959.036645] gmc_v10_0_process_interrupt: 42 callbacks suppressed
>>>
>>> [  959.036653] amdgpu 0000:00:07.0: [mmhub] page fault (src_id:0
>>> ring:0 vmid:0 pasid:0, for process  pid 0 thread  pid 0)
>>>
>>> [  959.038043] amdgpu 0000:00:07.0:   in page starting at address
>> 0x0000000000567000 from client 18
>>> [  959.039014] amdgpu 0000:00:07.0: [mmhub] page fault (src_id:0
>>> ring:0 vmid:0 pasid:0, for process  pid 0 thread  pid 0)
>>>
>>> [  959.040202] amdgpu 0000:00:07.0:   in page starting at address
>> 0x0000000000567000 from client 18
>>> [  959.041174] amdgpu 0000:00:07.0: [mmhub] page fault (src_id:0
>>> ring:0 vmid:0 pasid:0, for process  pid 0 thread  pid 0)
>>>
>>> [  959.042353] amdgpu 0000:00:07.0:   in page starting at address
>> 0x0000000000567000 from client 18
>>> [  959.043325] amdgpu 0000:00:07.0: [mmhub] page fault (src_id:0
>>> ring:0 vmid:0 pasid:0, for process  pid 0 thread  pid 0)
>>>
>>> [  959.044508] amdgpu 0000:00:07.0:   in page starting at address
>> 0x0000000000567000 from client 18
>>> [  959.045480] amdgpu 0000:00:07.0: [mmhub] page fault (src_id:0
>>> ring:0 vmid:0 pasid:0, for process  pid 0 thread  pid 0)
>>>
>>> [  959.046659] amdgpu 0000:00:07.0:   in page starting at address
>> 0x0000000000567000 from client 18
>>> [  959.047631] amdgpu 0000:00:07.0: [mmhub] page fault (src_id:0
>>> ring:0 vmid:0 pasid:0, for process  pid 0 thread  pid 0)
>>>
>>> [  959.048815] amdgpu 0000:00:07.0:   in page starting at address
>> 0x0000000000567000 from client 18
>>> [  959.049787] amdgpu 0000:00:07.0: [mmhub] page fault (src_id:0
>>> ring:0 vmid:0 pasid:0, for process  pid 0 thread  pid 0)
>>>
>>> [  959.050973] amdgpu 0000:00:07.0:   in page starting at address
>> 0x0000000000567000 from client 18
>>> [  959.051950] amdgpu 0000:00:07.0: [mmhub] page fault (src_id:0
>>> ring:0 vmid:0 pasid:0, for process  pid 0 thread  pid 0)
>>>
>>> [  959.053123] amdgpu 0000:00:07.0:   in page starting at address
>> 0x0000000000567000 from client 18
>>> [  967.208705] amdgpu 0000:00:07.0: [drm:amdgpu_ib_ring_tests [amdgpu]]
>> *ERROR* IB test failed on vcn_enc0 (-110).
>>> [  967.209879] [drm:amdgpu_device_init [amdgpu]] *ERROR* ib ring test
>> failed (-110).
>>>
>>>
>>> [ 1209.384668] INFO: task modprobe:23957 blocked for more than 120
>> seconds.
>>> [ 1209.385605]       Tainted: G           OE     5.4.0-58-generic #64~18.04.1-
>> Ubuntu
>>> [ 1209.386451] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs"
>> disables this message.
>>> [ 1209.387342] modprobe        D    0 23957   1221 0x80004006
>>>
>>> [ 1209.387344] Call Trace:
>>>
>>> [ 1209.387354]  __schedule+0x293/0x720
>>>
>>> [ 1209.387356]  schedule+0x33/0xa0
>>>
>>> [ 1209.387357]  schedule_timeout+0x1d3/0x320
>>>
>>> [ 1209.387359]  ? schedule+0x33/0xa0
>>>
>>> [ 1209.387360]  ? schedule_timeout+0x1d3/0x320
>>>
>>> [ 1209.387363]  dma_fence_default_wait+0x1b2/0x1e0
>>>
>>> [ 1209.387364]  ? dma_fence_release+0x130/0x130
>>>
>>> [ 1209.387366]  dma_fence_wait_timeout+0xfd/0x110
>>>
>>> [ 1209.387773]  amdgpu_fence_wait_empty+0x90/0xc0 [amdgpu]
>>>
>>> [ 1209.387839]  amdgpu_fence_driver_fini+0xd6/0x110 [amdgpu]
>>>> Regards,
>>>> Christian.
>>>>
>>>>>> Regards,
>>>>>> Christian.
>>>>>>
>>>>>>> Signed-off-by: Roy Sun <Roy.Sun at amd.com>
>>>>>>> ---
>>>>>>>      drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c | 48
>>>>>> ++++++++++++++++++++---
>>>>>>>      1 file changed, 43 insertions(+), 5 deletions(-)
>>>>>>>
>>>>>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c
>>>>>>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c
>>>>>>> index 6b0aeee61b8b..738ea65077ea 100644
>>>>>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c
>>>>>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c
>>>>>>> @@ -41,6 +41,8 @@
>>>>>>>      #include "amdgpu.h"
>>>>>>>      #include "amdgpu_trace.h"
>>>>>>>
>>>>>>> +#define AMDGPU_FENCE_TIMEOUT  msecs_to_jiffies(1000) #define
>>>>>>> +AMDGPU_FENCE_GFX_XGMI_TIMEOUT msecs_to_jiffies(2000)
>>>>>>>      /*
>>>>>>>       * Fences
>>>>>>>       * Fences mark an event in the GPUs pipeline and are used @@
>>>>>>> -104,6
>>>>>>> +106,38 @@ static void amdgpu_fence_write(struct amdgpu_ring
>>>>>>> +*ring,
>>>>>>> +u32
>>>>>> seq)
>>>>>>>      *drv->cpu_addr = cpu_to_le32(seq);
>>>>>>>      }
>>>>>>>
>>>>>>> +/**
>>>>>>> + * amdgpu_fence_wait_timeout - get the fence wait timeout
>>>>>>> + *
>>>>>>> + * @ring: ring the fence is associated with
>>>>>>> + *
>>>>>>> + * Returns the value of the fence wait timeout.
>>>>>>> + */
>>>>>>> +long amdgpu_fence_wait_timeout(struct amdgpu_ring *ring) { long
>>>>>>> +tmo_gfx, tmo_mm, tmo; struct amdgpu_device *adev = ring->adev;
>>>>>>> +tmo_mm = tmo_gfx = AMDGPU_FENCE_TIMEOUT; if
>>>> (amdgpu_sriov_vf(adev))
>>>>>>> +{ tmo_mm = 8 * AMDGPU_FENCE_TIMEOUT; } if
>>>>>>> +(amdgpu_sriov_runtime(adev)) { tmo_gfx = 8 *
>>>> AMDGPU_FENCE_TIMEOUT;
>>>>>>> +} else if (adev->gmc.xgmi.hive_id) { tmo_gfx =
>>>>>>> +AMDGPU_FENCE_GFX_XGMI_TIMEOUT; } if (ring->funcs->type ==
>>>>>>> +AMDGPU_RING_TYPE_UVD ||
>>>>>>> +ring->funcs->type == AMDGPU_RING_TYPE_VCE || type ==
>>>>>>> +ring->funcs->AMDGPU_RING_TYPE_UVD_ENC || type ==
>>>>>>> +ring->funcs->AMDGPU_RING_TYPE_VCN_DEC || type ==
>>>>>>> +ring->funcs->AMDGPU_RING_TYPE_VCN_ENC || type ==
>>>>>>> +ring->funcs->AMDGPU_RING_TYPE_VCN_JPEG)
>>>>>>> +tmo = tmo_mm;
>>>>>>> +else
>>>>>>> +tmo = tmo_gfx;
>>>>>>> +return tmo;
>>>>>>> +}
>>>>>>> +
>>>>>>>      /**
>>>>>>>       * amdgpu_fence_read - read a fence value
>>>>>>>       *
>>>>>>> @@ -166,10 +200,12 @@ int amdgpu_fence_emit(struct amdgpu_ring
>>>>>>> *ring,
>>>>>> struct dma_fence **f,
>>>>>>>      rcu_read_unlock();
>>>>>>>
>>>>>>>      if (old) {
>>>>>>> -r = dma_fence_wait(old, false);
>>>>>>> +long timeout;
>>>>>>> +timeout = amdgpu_fence_wait_timeout(ring); r =
>>>>>>> +dma_fence_wait_timeout(old, false, timeout);
>>>>>>>      dma_fence_put(old);
>>>>>>>      if (r)
>>>>>>> -return r;
>>>>>>> +return r < 0 ? r : 0;
>>>>>>>      }
>>>>>>>      }
>>>>>>>
>>>>>>> @@ -343,10 +379,12 @@ int amdgpu_fence_wait_empty(struct
>>>>>> amdgpu_ring *ring)
>>>>>>>      return 0;
>>>>>>>      }
>>>>>>>      rcu_read_unlock();
>>>>>>> -
>>>>>>> -r = dma_fence_wait(fence, false);
>>>>>>> +
>>>>>>> +long timeout;
>>>>>>> +timeout = amdgpu_fence_wait_timeout(ring); r =
>>>>>>> +dma_fence_wait_timeout(fence, false, timeout);
>>>>>>>      dma_fence_put(fence);
>>>>>>> -return r;
>>>>>>> +return r < 0 ? r : 0;
>>>>>>>      }
>>>>>>>
>>>>>>>      /**
>>>>>> _______________________________________________
>>>>>> amd-gfx mailing list
>>>>>> amd-gfx at lists.freedesktop.org
>>>>>> https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Fl
>>>>>> is
>>>>>> ts.f
>>>>>> reedesktop.org%2Fmailman%2Flistinfo%2Famd-
>>>>>>
>> gfx&data=04%7C01%7CEmily.Deng%40amd.com%7C8b116229938b463
>> df87f08d8b7cbf607%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7
>> C637461433936049544%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAw
>> MDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&sda
>> ta=HOcLHmmblOUHXATFBl5HC6LOmFq0oXglAh2GFwd6sus%3D&reserve
>>>>>> d=0
> _______________________________________________
> amd-gfx mailing list
> amd-gfx at lists.freedesktop.org
> https://lists.freedesktop.org/mailman/listinfo/amd-gfx