[PATCH] drm/amd: Make SW CTF handler cope with different read_sensor() results

Fri Dec 15 12:13:01 UTC 2023

On 12/14/2023 10:15 PM, Mario Limonciello wrote:
> The SW CTF handler assumes that the read_sensor() call always succeeds
> and has updated `hotspot_tmp`, but this may not be guaranteed.
> 
> For example some of the read_sensor() callbacks will return 0 when a RAS
> interrupt is triggered in which case `hotspot_tmp` won't be updated.
> 

The fix needs to be in the return value of read_sensor() to return EBUSY 
or similar. That will also help other APIs calling read_sensor().

Regards,
Lijo

> Adjust the logic to catch this circumstance and output a warning.
> 
> Signed-off-by: Mario Limonciello <mario.limonciello at amd.com>
> ---
>   drivers/gpu/drm/amd/pm/swsmu/amdgpu_smu.c | 20 +++++++++++---------
>   1 file changed, 11 insertions(+), 9 deletions(-)
> 
> diff --git a/drivers/gpu/drm/amd/pm/swsmu/amdgpu_smu.c b/drivers/gpu/drm/amd/pm/swsmu/amdgpu_smu.c
> index e1a5ee911dbb..5473fda5c6aa 100644
> --- a/drivers/gpu/drm/amd/pm/swsmu/amdgpu_smu.c
> +++ b/drivers/gpu/drm/amd/pm/swsmu/amdgpu_smu.c
> @@ -1163,21 +1163,23 @@ static void smu_swctf_delayed_work_handler(struct work_struct *work)
>   	struct smu_temperature_range *range =
>   				&smu->thermal_range;
>   	struct amdgpu_device *adev = smu->adev;
> -	uint32_t hotspot_tmp, size;
> +	uint32_t hotspot_tmp = 0, size;
>   
>   	/*
>   	 * If the hotspot temperature is confirmed as below SW CTF setting point
>   	 * after the delay enforced, nothing will be done.
>   	 * Otherwise, a graceful shutdown will be performed to prevent further damage.
>   	 */
> -	if (range->software_shutdown_temp &&
> -	    smu->ppt_funcs->read_sensor &&
> -	    !smu->ppt_funcs->read_sensor(smu,
> -					 AMDGPU_PP_SENSOR_HOTSPOT_TEMP,
> -					 &hotspot_tmp,
> -					 &size) &&
> -	    hotspot_tmp / 1000 < range->software_shutdown_temp)
> -		return;
> +	if (range->software_shutdown_temp && smu->ppt_funcs->read_sensor) {
> +		int r = smu->ppt_funcs->read_sensor(smu,
> +						    AMDGPU_PP_SENSOR_HOTSPOT_TEMP,
> +						    &hotspot_tmp,
> +						    &size);
> +		if (!r && hotspot_tmp &&
> +		    (hotspot_tmp / 1000 < range->software_shutdown_temp))
> +			return;
> +		dev_warn(adev->dev, "Failed to read hotspot temperature: %d\n", r);
> +	}
>   
>   	dev_emerg(adev->dev, "ERROR: GPU over temperature range(SW CTF) detected!\n");
>   	dev_emerg(adev->dev, "ERROR: System is going to shutdown due to GPU SW CTF!\n");