[PATCH] drm/amdgpu: Add gpu_recovery parameter

Liu, Monk Monk.Liu at amd.com
Thu Dec 14 07:19:12 UTC 2017


> Problem with this is that amdgpu_check_soft_reset will not be called, this function which prints which IP block was hung even when later we opt not to recover.
I suggest instead to add a bool force_reset parameter to amdgpu_gpu_recover which will override amdgpu_gpu_recovery and we can set it to true from amdgpu_debugfs_gpu_recover only.

[ML] why you need "check_soft_reset" be called ? I think soft reset checking is useless totally ... because with TDR feature, the only thing can 
Tell  you if GPU hang is time out warning.

For soft checking, it only shows you if some IP is busy or not, but busy may not prove the engine is hang , it may just busy .... 


BR Monk

-----Original Message-----
From: Grodzovsky, Andrey 
Sent: 2017年12月13日 20:53
To: Koenig, Christian <Christian.Koenig at amd.com>; amd-gfx at lists.freedesktop.org
Cc: Liu, Monk <Monk.Liu at amd.com>; maraeo at gmail.com
Subject: Re: [PATCH] drm/amdgpu: Add gpu_recovery parameter



On 12/13/2017 07:20 AM, Christian König wrote:
> Am 12.12.2017 um 20:16 schrieb Andrey Grodzovsky:
>> Add new parameter to control GPU recovery procedure.
>> Retire old way of disabling GPU recovery by setting lockup_timeout ==
>> 0 and
>> set default for lockup_timeout to 10s.
>>
>> Signed-off-by: Andrey Grodzovsky <andrey.grodzovsky at amd.com>
>> ---
>>   drivers/gpu/drm/amd/amdgpu/amdgpu.h        | 1 +
>>   drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 5 +++++
>>   drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c    | 8 ++++++--
>>   3 files changed, 12 insertions(+), 2 deletions(-)
>>
>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu.h
>> b/drivers/gpu/drm/amd/amdgpu/amdgpu.h
>> index 3735500..26abe03 100644
>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu.h
>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu.h
>> @@ -126,6 +126,7 @@ extern int amdgpu_param_buf_per_se;
>>   extern int amdgpu_job_hang_limit;
>>   extern int amdgpu_lbpw;
>>   extern int amdgpu_compute_multipipe;
>> +extern int amdgpu_gpu_recovery;
>>     #ifdef CONFIG_DRM_AMDGPU_SI
>>   extern int amdgpu_si_support;
>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>> index 8d03baa..d84b57a 100644
>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>> @@ -3030,6 +3030,11 @@ int amdgpu_gpu_recover(struct amdgpu_device 
>> *adev, struct amdgpu_job *job)
>>           return 0;
>>       }
>>   +    if (!amdgpu_gpu_recovery) {
>> +        DRM_INFO("GPU recovery disabled.\n");
>> +        return 0;
>> +    }
>> +
>
> Please move this check into the caller of amdgpu_gpu_recover().
>
> This way we can still trigger a GPU recovery manually or from the 
> hypervisor under SRIOV.
>
> Christian.

Problem with this is that amdgpu_check_soft_reset will not be called, this function which prints which IP block was hung even when later we opt not to recover.
I suggest instead to add a bool force_reset parameter to amdgpu_gpu_recover which will override amdgpu_gpu_recovery and we can set it to true from amdgpu_debugfs_gpu_recover only.

Thanks,
Andrey

>
>>       dev_info(adev->dev, "GPU reset begin!\n");
>>         mutex_lock(&adev->lock_reset); diff --git 
>> a/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c
>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c
>> index 0b039bd..5c612e9 100644
>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c
>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c
>> @@ -90,7 +90,7 @@ int amdgpu_disp_priority = 0;
>>   int amdgpu_hw_i2c = 0;
>>   int amdgpu_pcie_gen2 = -1;
>>   int amdgpu_msi = -1;
>> -int amdgpu_lockup_timeout = 0;
>> +int amdgpu_lockup_timeout = 10000;
>>   int amdgpu_dpm = -1;
>>   int amdgpu_fw_load_type = -1;
>>   int amdgpu_aspm = -1;
>> @@ -128,6 +128,7 @@ int amdgpu_param_buf_per_se = 0;
>>   int amdgpu_job_hang_limit = 0;
>>   int amdgpu_lbpw = -1;
>>   int amdgpu_compute_multipipe = -1;
>> +int amdgpu_gpu_recovery = 1;
>>     MODULE_PARM_DESC(vramlimit, "Restrict VRAM for testing, in 
>> megabytes");
>>   module_param_named(vramlimit, amdgpu_vram_limit, int, 0600); @@ 
>> -165,7 +166,7 @@ module_param_named(pcie_gen2, amdgpu_pcie_gen2, int, 
>> 0444);
>>   MODULE_PARM_DESC(msi, "MSI support (1 = enable, 0 = disable, -1 = 
>> auto)");
>>   module_param_named(msi, amdgpu_msi, int, 0444);
>>   -MODULE_PARM_DESC(lockup_timeout, "GPU lockup timeout in ms 
>> (default 0 = disable)");
>> +MODULE_PARM_DESC(lockup_timeout, "GPU lockup timeout in ms (default
>> 10000)");
>>   module_param_named(lockup_timeout, amdgpu_lockup_timeout, int, 
>> 0444);
>>     MODULE_PARM_DESC(dpm, "DPM support (1 = enable, 0 = disable, -1 = 
>> auto)"); @@ -280,6 +281,9 @@ module_param_named(lbpw, amdgpu_lbpw, 
>> int, 0444);
>>   MODULE_PARM_DESC(compute_multipipe, "Force compute queues to be 
>> spread across pipes (1 = enable, 0 = disable, -1 = auto)");
>>   module_param_named(compute_multipipe, amdgpu_compute_multipipe, 
>> int, 0444);
>>   +MODULE_PARM_DESC(gpu_recovery, "Enable GPU recovery mechanism, (1 
>> = enable (default) , 0 = disable");
>> +module_param_named(gpu_recovery, amdgpu_gpu_recovery, int, 0444);
>> +
>>   #ifdef CONFIG_DRM_AMDGPU_SI
>>     #if defined(CONFIG_DRM_RADEON) || 
>> defined(CONFIG_DRM_RADEON_MODULE)
>
> _______________________________________________
> amd-gfx mailing list
> amd-gfx at lists.freedesktop.org
> https://lists.freedesktop.org/mailman/listinfo/amd-gfx



More information about the amd-gfx mailing list