[PATCH 2/2] drm/amdgpu: add support for SMU debug option

Quan, Evan Evan.Quan at amd.com
Fri Dec 10 02:07:02 UTC 2021


[AMD Official Use Only]



> -----Original Message-----
> From: amd-gfx <amd-gfx-bounces at lists.freedesktop.org> On Behalf Of Lang
> Yu
> Sent: Thursday, December 9, 2021 4:49 PM
> To: amd-gfx at lists.freedesktop.org
> Cc: Grodzovsky, Andrey <Andrey.Grodzovsky at amd.com>; Lazar, Lijo
> <Lijo.Lazar at amd.com>; Huang, Ray <Ray.Huang at amd.com>; Deucher,
> Alexander <Alexander.Deucher at amd.com>; Yu, Lang <Lang.Yu at amd.com>;
> Koenig, Christian <Christian.Koenig at amd.com>
> Subject: [PATCH 2/2] drm/amdgpu: add support for SMU debug option
> 
> SMU firmware guys expect the driver maintains error context
> and doesn't interact with SMU any more when SMU errors occurred.
> That will aid in debugging SMU firmware issues.
> 
> Add SMU debug option support for this request, it can be
> enabled or disabled via amdgpu_smu_debug debugfs file.
> When enabled, it brings hardware to a kind of halt state
> so that no one can touch it any more in the envent of SMU
> errors.
> 
> Currently, dirver interacts with SMU via sending messages.
> And threre are three ways to sending messages to SMU.
> Handle them respectively as following:
> 
> 1, smu_cmn_send_smc_msg_with_param() for normal timeout cases
> 
>   Halt on any error.
> 
> 2, smu_cmn_send_msg_without_waiting()/smu_cmn_wait_for_response()
> for longer timeout cases
> 
>   Halt on errors apart from ETIME. Otherwise this way won't work.
> 
> 3, smu_cmn_send_msg_without_waiting() for no waiting cases
> 
>   Halt on errors apart from ETIME. Otherwise second way won't work.
> 
> After halting, use BUG() to explicitly notify users.
> 
> == Command Guide ==
> 
> 1, enable SMU debug option
> 
>  # echo 1 > /sys/kernel/debug/dri/0/amdgpu_smu_debug
> 
> 2, disable SMU debug option
> 
>  # echo 0 > /sys/kernel/debug/dri/0/amdgpu_smu_debug
> 
> v4:
>  - Set to halt state instead of a simple hang.(Christian)
> 
> v3:
>  - Use debugfs_create_bool().(Christian)
>  - Put variable into smu_context struct.
>  - Don't resend command when timeout.
> 
> v2:
>  - Resend command when timeout.(Lijo)
>  - Use debugfs file instead of module parameter.
> 
> Signed-off-by: Lang Yu <lang.yu at amd.com>
> ---
>  drivers/gpu/drm/amd/amdgpu/amdgpu_debugfs.c |  3 +++
>  drivers/gpu/drm/amd/pm/inc/amdgpu_smu.h     |  5 +++++
>  drivers/gpu/drm/amd/pm/swsmu/smu_cmn.c      | 20
> +++++++++++++++++++-
>  3 files changed, 27 insertions(+), 1 deletion(-)
> 
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_debugfs.c
> b/drivers/gpu/drm/amd/amdgpu/amdgpu_debugfs.c
> index 164d6a9e9fbb..86cd888c7822 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_debugfs.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_debugfs.c
> @@ -1618,6 +1618,9 @@ int amdgpu_debugfs_init(struct amdgpu_device
> *adev)
>  	if (!debugfs_initialized())
>  		return 0;
> 
> +	debugfs_create_bool("amdgpu_smu_debug", 0600, root,
> +				  &adev->smu.smu_debug_mode);
> +
>  	ent = debugfs_create_file("amdgpu_preempt_ib", 0600, root, adev,
>  				  &fops_ib_preempt);
>  	if (IS_ERR(ent)) {
> diff --git a/drivers/gpu/drm/amd/pm/inc/amdgpu_smu.h
> b/drivers/gpu/drm/amd/pm/inc/amdgpu_smu.h
> index f738f7dc20c9..50dbf5594a9d 100644
> --- a/drivers/gpu/drm/amd/pm/inc/amdgpu_smu.h
> +++ b/drivers/gpu/drm/amd/pm/inc/amdgpu_smu.h
> @@ -569,6 +569,11 @@ struct smu_context
>  	struct smu_user_dpm_profile user_dpm_profile;
> 
>  	struct stb_context stb_context;
> +	/*
> +	 * When enabled, it makes SMU errors fatal.
> +	 * (0 = disabled (default), 1 = enabled)
> +	 */
> +	bool smu_debug_mode;
[Quan, Evan] Can you expand this to bit mask(as ppfeaturemask)? So that in future we can add support for other debug features.
>  };
> 
>  struct i2c_adapter;
> diff --git a/drivers/gpu/drm/amd/pm/swsmu/smu_cmn.c
> b/drivers/gpu/drm/amd/pm/swsmu/smu_cmn.c
> index 048ca1673863..84016d22c075 100644
> --- a/drivers/gpu/drm/amd/pm/swsmu/smu_cmn.c
> +++ b/drivers/gpu/drm/amd/pm/swsmu/smu_cmn.c
> @@ -272,6 +272,11 @@ int smu_cmn_send_msg_without_waiting(struct
> smu_context *smu,
>  	__smu_cmn_send_msg(smu, msg_index, param);
>  	res = 0;
>  Out:
> +	if (unlikely(smu->smu_debug_mode) && res && (res != -ETIME)) {
> +		amdgpu_device_halt(smu->adev);
> +		BUG();
[Quan, Evan] I agree amdgpu_device_halt() is a good idea. Christian and Andrey can share you more insights about that.
Do we still need the "BUG()" then? 

BR
Evan
> +	}
> +
>  	return res;
>  }
> 
> @@ -288,9 +293,17 @@ int smu_cmn_send_msg_without_waiting(struct
> smu_context *smu,
>  int smu_cmn_wait_for_response(struct smu_context *smu)
>  {
>  	u32 reg;
> +	int res;
> 
>  	reg = __smu_cmn_poll_stat(smu);
> -	return __smu_cmn_reg2errno(smu, reg);
> +	res = __smu_cmn_reg2errno(smu, reg);
> +
> +	if (unlikely(smu->smu_debug_mode) && res && (res != -ETIME)) {
> +		amdgpu_device_halt(smu->adev);
> +		BUG();
> +	}
> +
> +	return res;
>  }
> 
>  /**
> @@ -357,6 +370,11 @@ int smu_cmn_send_smc_msg_with_param(struct
> smu_context *smu,
>  	if (read_arg)
>  		smu_cmn_read_arg(smu, read_arg);
>  Out:
> +	if (unlikely(smu->smu_debug_mode) && res) {
> +		amdgpu_device_halt(smu->adev);
> +		BUG();
> +	}
> +
>  	mutex_unlock(&smu->message_lock);
>  	return res;
>  }
> --
> 2.25.1


More information about the amd-gfx mailing list