[PATCH] drm/amdgpu/sriov: give 8s for recover vram under RUNTIME

Thu Aug 9 07:01:44 UTC 2018

Hi Menzel,
     Thanks for your advice, will have a try.

Best wishes
Emily Deng

-----Original Message-----
From: Paul Menzel <pmenzel+amd-gfx at molgen.mpg.de> 
Sent: Thursday, August 9, 2018 2:47 PM
To: Deng, Emily <Emily.Deng at amd.com>
Cc: amd-gfx at lists.freedesktop.org; Liu, Monk <Monk.Liu at amd.com>
Subject: Re: [PATCH] drm/amdgpu/sriov: give 8s for recover vram under RUNTIME

Dear Emily,

Thank you for your reply.

Am 09.08.2018 um 04:11 schrieb Deng, Emily:
> Hi Paul,
>       Answers as below:

It’d be great if you had an mail user agent (email program) that supports quoting. Maybe you could configure Outlook(?) that way? That would tremendously useful for the mailing list archives too.

> -----Original Message-----
> From: Paul Menzel <pmenzel+amd-gfx at molgen.mpg.de>
> Sent: Wednesday, August 8, 2018 10:29 PM
> To: Deng, Emily <Emily.Deng at amd.com>; amd-gfx at lists.freedesktop.org
> Cc: Liu, Monk <Monk.Liu at amd.com>
> Subject: Re: [PATCH] drm/amdgpu/sriov: give 8s for recover vram under 
> RUNTIME

> On 08/08/18 04:13, Emily Deng wrote:

[…]

>> Extend the timeout for recovering vram bos from shadows on sr-iov to 
>> cover the worst case scenario for timeslices and VFs
>>
>> Under runtime, the wait fence time could be quite long when other VFs 
>> are in exclusive mode. For example, for 4 VF, every VF's exclusive 
>> timeout time is set to 3s, then the worst case is 9s. If the VF 
>> number is more than 4,then the worst case time will be longer.
> 
> Nit: Missing space after the comma.
> 
> How did you get to nine seconds? Isn’t it four times three, which is 
> twelve?
> [Emily] It is 3 times three, not 4, as the VF only need to wait the 
> other three VF's exclusive timeout.

Thank you, I didn’t know that.

(It looks like you missed my inline comments for the change itself
below.)

>> The 8s is the test data, with setting to 8s, it will pass the TDR 
>> test for 1000 times.
>>
>> SWDEV-161490
>>
>> Change-Id: Ifc32d56ca7fde01b1f4fe2b0db6959b51909008a
>> Signed-off-by: Monk Liu <Monk.Liu at amd.com>
>> Signed-off-by: Emily Deng <Emily.Deng at amd.com>
>> ---
>>   drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 2 +-
>>   1 file changed, 1 insertion(+), 1 deletion(-)
>>
>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c 
>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>> index 1d933db..ef82ad1 100644
>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>> @@ -3124,7 +3124,7 @@ static int amdgpu_device_handle_vram_lost(struct amdgpu_device *adev)
>>   	long tmo;
>>   
>>   	if (amdgpu_sriov_runtime(adev))
>> -		tmo = msecs_to_jiffies(amdgpu_lockup_timeout);
>> +		tmo = msecs_to_jiffies(8000);
> 
> Actually, I do not understand the change at all. Isn’t that a module 
> parameter?
> 
> ```
> $ git grep amdgpu_lockup_timeout
> drivers/gpu/drm/amd/amdgpu/amdgpu.h:extern int amdgpu_lockup_timeout;
> drivers/gpu/drm/amd/amdgpu/amdgpu_device.c:     if (amdgpu_lockup_timeout == 0) {
> drivers/gpu/drm/amd/amdgpu/amdgpu_device.c:             amdgpu_lockup_timeout = 10000;
> drivers/gpu/drm/amd/amdgpu/amdgpu_device.c:             tmo = msecs_to_jiffies(amdgpu_lockup_timeout);
> drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c:int amdgpu_lockup_timeout = 
> 10000; drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c:module_param_named(lockup_timeout, amdgpu_lockup_timeout, int, 0444);
> drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c:                      timeout = msecs_to_jiffies(amdgpu_lockup_timeout);
> drivers/gpu/drm/amd/amdgpu/mxgpu_ai.c:  if (amdgpu_lockup_timeout == 
> 0) ```
> 
> So, if it’s not set, the time-out is set to 10(!) seconds anyway, 
> isn’t it? What am I missing?
> 
>>   	else
>>   		tmo = msecs_to_jiffies(100);
>>   
>>
> 
> In my opinion, such time-outs need to have a big FIXME added to it, 
> and VF’s exclusive time-outs should be drastically decreased.

Kind regards,

Paul