[PATCH 2/2] drm/amdgpu: add support for SMU debug option

Lang Yu Lang.Yu at amd.com
Fri Dec 10 07:49:24 UTC 2021


On 12/10/ , Christian KKKnig wrote:
> Am 10.12.21 um 04:21 schrieb Lang Yu:
> > On 12/10/ , Quan, Evan wrote:
> > > [AMD Official Use Only]
> > > 
> > > 
> > > 
> > > > -----Original Message-----
> > > > From: Yu, Lang <Lang.Yu at amd.com>
> > > > Sent: Friday, December 10, 2021 10:34 AM
> > > > To: Quan, Evan <Evan.Quan at amd.com>
> > > > Cc: amd-gfx at lists.freedesktop.org; Grodzovsky, Andrey
> > > > <Andrey.Grodzovsky at amd.com>; Lazar, Lijo <Lijo.Lazar at amd.com>; Huang,
> > > > Ray <Ray.Huang at amd.com>; Deucher, Alexander
> > > > <Alexander.Deucher at amd.com>; Koenig, Christian
> > > > <Christian.Koenig at amd.com>
> > > > Subject: Re: [PATCH 2/2] drm/amdgpu: add support for SMU debug option
> > > > 
> > > > On 12/10/ , Quan, Evan wrote:
> > > > > [AMD Official Use Only]
> > > > > 
> > > > > 
> > > > > 
> > > > > > -----Original Message-----
> > > > > > From: amd-gfx <amd-gfx-bounces at lists.freedesktop.org> On Behalf Of
> > > > > > Lang Yu
> > > > > > Sent: Thursday, December 9, 2021 4:49 PM
> > > > > > To: amd-gfx at lists.freedesktop.org
> > > > > > Cc: Grodzovsky, Andrey <Andrey.Grodzovsky at amd.com>; Lazar, Lijo
> > > > > > <Lijo.Lazar at amd.com>; Huang, Ray <Ray.Huang at amd.com>; Deucher,
> > > > > > Alexander <Alexander.Deucher at amd.com>; Yu, Lang
> > > > <Lang.Yu at amd.com>;
> > > > > > Koenig, Christian <Christian.Koenig at amd.com>
> > > > > > Subject: [PATCH 2/2] drm/amdgpu: add support for SMU debug option
> > > > > > 
> > > > > > SMU firmware guys expect the driver maintains error context and
> > > > > > doesn't interact with SMU any more when SMU errors occurred.
> > > > > > That will aid in debugging SMU firmware issues.
> > > > > > 
> > > > > > Add SMU debug option support for this request, it can be enabled or
> > > > > > disabled via amdgpu_smu_debug debugfs file.
> > > > > > When enabled, it brings hardware to a kind of halt state so that no
> > > > > > one can touch it any more in the envent of SMU errors.
> > > > > > 
> > > > > > Currently, dirver interacts with SMU via sending messages.
> > > > > > And threre are three ways to sending messages to SMU.
> > > > > > Handle them respectively as following:
> > > > > > 
> > > > > > 1, smu_cmn_send_smc_msg_with_param() for normal timeout cases
> > > > > > 
> > > > > >    Halt on any error.
> > > > > > 
> > > > > > 2,
> > > > smu_cmn_send_msg_without_waiting()/smu_cmn_wait_for_response()
> > > > > > for longer timeout cases
> > > > > > 
> > > > > >    Halt on errors apart from ETIME. Otherwise this way won't work.
> > > > > > 
> > > > > > 3, smu_cmn_send_msg_without_waiting() for no waiting cases
> > > > > > 
> > > > > >    Halt on errors apart from ETIME. Otherwise second way won't work.
> > > > > > 
> > > > > > After halting, use BUG() to explicitly notify users.
> > > > > > 
> > > > > > == Command Guide ==
> > > > > > 
> > > > > > 1, enable SMU debug option
> > > > > > 
> > > > > >   # echo 1 > /sys/kernel/debug/dri/0/amdgpu_smu_debug
> > > > > > 
> > > > > > 2, disable SMU debug option
> > > > > > 
> > > > > >   # echo 0 > /sys/kernel/debug/dri/0/amdgpu_smu_debug
> > > > > > 
> > > > > > v4:
> > > > > >   - Set to halt state instead of a simple hang.(Christian)
> > > > > > 
> > > > > > v3:
> > > > > >   - Use debugfs_create_bool().(Christian)
> > > > > >   - Put variable into smu_context struct.
> > > > > >   - Don't resend command when timeout.
> > > > > > 
> > > > > > v2:
> > > > > >   - Resend command when timeout.(Lijo)
> > > > > >   - Use debugfs file instead of module parameter.
> > > > > > 
> > > > > > Signed-off-by: Lang Yu <lang.yu at amd.com>
> > > > > > ---
> > > > > >   drivers/gpu/drm/amd/amdgpu/amdgpu_debugfs.c |  3 +++
> > > > > >   drivers/gpu/drm/amd/pm/inc/amdgpu_smu.h     |  5 +++++
> > > > > >   drivers/gpu/drm/amd/pm/swsmu/smu_cmn.c      | 20
> > > > > > +++++++++++++++++++-
> > > > > >   3 files changed, 27 insertions(+), 1 deletion(-)
> > > > > > 
> > > > > > diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_debugfs.c
> > > > > > b/drivers/gpu/drm/amd/amdgpu/amdgpu_debugfs.c
> > > > > > index 164d6a9e9fbb..86cd888c7822 100644
> > > > > > --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_debugfs.c
> > > > > > +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_debugfs.c
> > > > > > @@ -1618,6 +1618,9 @@ int amdgpu_debugfs_init(struct
> > > > amdgpu_device
> > > > > > *adev)
> > > > > >   	if (!debugfs_initialized())
> > > > > >   		return 0;
> > > > > > 
> > > > > > +	debugfs_create_bool("amdgpu_smu_debug", 0600, root,
> > > > > > +				  &adev->smu.smu_debug_mode);
> > > > > > +
> > > > > >   	ent = debugfs_create_file("amdgpu_preempt_ib", 0600, root, adev,
> > > > > >   				  &fops_ib_preempt);
> > > > > >   	if (IS_ERR(ent)) {
> > > > > > diff --git a/drivers/gpu/drm/amd/pm/inc/amdgpu_smu.h
> > > > > > b/drivers/gpu/drm/amd/pm/inc/amdgpu_smu.h
> > > > > > index f738f7dc20c9..50dbf5594a9d 100644
> > > > > > --- a/drivers/gpu/drm/amd/pm/inc/amdgpu_smu.h
> > > > > > +++ b/drivers/gpu/drm/amd/pm/inc/amdgpu_smu.h
> > > > > > @@ -569,6 +569,11 @@ struct smu_context
> > > > > >   	struct smu_user_dpm_profile user_dpm_profile;
> > > > > > 
> > > > > >   	struct stb_context stb_context;
> > > > > > +	/*
> > > > > > +	 * When enabled, it makes SMU errors fatal.
> > > > > > +	 * (0 = disabled (default), 1 = enabled)
> > > > > > +	 */
> > > > > > +	bool smu_debug_mode;
> > > > > [Quan, Evan] Can you expand this to bit mask(as ppfeaturemask)? So that
> > > > in future we can add support for other debug features.
> > > > > >   };
> > > > OK.
> > > > 
> > > > > >   struct i2c_adapter;
> > > > > > diff --git a/drivers/gpu/drm/amd/pm/swsmu/smu_cmn.c
> > > > > > b/drivers/gpu/drm/amd/pm/swsmu/smu_cmn.c
> > > > > > index 048ca1673863..84016d22c075 100644
> > > > > > --- a/drivers/gpu/drm/amd/pm/swsmu/smu_cmn.c
> > > > > > +++ b/drivers/gpu/drm/amd/pm/swsmu/smu_cmn.c
> > > > > > @@ -272,6 +272,11 @@ int smu_cmn_send_msg_without_waiting(struct
> > > > > > smu_context *smu,
> > > > > >   	__smu_cmn_send_msg(smu, msg_index, param);
> > > > > >   	res = 0;
> > > > > >   Out:
> > > > > > +	if (unlikely(smu->smu_debug_mode) && res && (res != -ETIME)) {
> > > > > > +		amdgpu_device_halt(smu->adev);
> > > > > > +		BUG();
> > > > > [Quan, Evan] I agree amdgpu_device_halt() is a good idea. Christian and
> > > > Andrey can share you more insights about that.
> > > > > Do we still need the "BUG()" then?
> > > > The BUG() is used to explicitly notify users something went wrong.
> > > > Otherwise userspace may not know immediately.
> > > > FW guys request this in ticket.
> > > [Quan, Evan] Won't drm_dev_unplug() and pci_disable_device() used in amdgpu_device_halt throw some errors(on user's further attempt to communicate with our driver)?
> > > Also if the purpose is to raise user's concern, WARN() may be a more gentle way?
> >  From my testing and observation, it depends on what the driver will do next.
> > Probably trigger a page fault. If we don't connect a monitor but SSH access.
> > We don't know what happended until something like a page fault is triggered.
> > 
> > But here what I want to do is throwing the error immediately to userspace.
> > If using WARN(), the user need to poll dmesg to see if something went wrong.
> 
> I agree with Evan as well. Please use WARN() or WARN_ON() here instead.
> 
> BUG() and BUG_ON() is only allowed when you prevent further data corruption
> by intentionally crashing the kernel thread. But that is really invasive
> because for example locks are not necessarily dropped and so can crash the
> whole system.

Ok, will use WARN().

Regards,
Lang

> Regards,
> Christian.
> 
> > 
> > Regards,
> > Lang
> > 
> > > BR
> > > Evan
> > > > Regards,
> > > > Lang
> > > > 
> > > > > BR
> > > > > Evan
> > > > > > +	}
> > > > > > +
> > > > > >   	return res;
> > > > > >   }
> > > > > > 
> > > > > > @@ -288,9 +293,17 @@ int smu_cmn_send_msg_without_waiting(struct
> > > > > > smu_context *smu,
> > > > > >   int smu_cmn_wait_for_response(struct smu_context *smu)  {
> > > > > >   	u32 reg;
> > > > > > +	int res;
> > > > > > 
> > > > > >   	reg = __smu_cmn_poll_stat(smu);
> > > > > > -	return __smu_cmn_reg2errno(smu, reg);
> > > > > > +	res = __smu_cmn_reg2errno(smu, reg);
> > > > > > +
> > > > > > +	if (unlikely(smu->smu_debug_mode) && res && (res != -ETIME)) {
> > > > > > +		amdgpu_device_halt(smu->adev);
> > > > > > +		BUG();
> > > > > > +	}
> > > > > > +
> > > > > > +	return res;
> > > > > >   }
> > > > > > 
> > > > > >   /**
> > > > > > @@ -357,6 +370,11 @@ int
> > > > smu_cmn_send_smc_msg_with_param(struct
> > > > > > smu_context *smu,
> > > > > >   	if (read_arg)
> > > > > >   		smu_cmn_read_arg(smu, read_arg);
> > > > > >   Out:
> > > > > > +	if (unlikely(smu->smu_debug_mode) && res) {
> > > > > > +		amdgpu_device_halt(smu->adev);
> > > > > > +		BUG();
> > > > > > +	}
> > > > > > +
> > > > > >   	mutex_unlock(&smu->message_lock);
> > > > > >   	return res;
> > > > > >   }
> > > > > > --
> > > > > > 2.25.1
> 


More information about the amd-gfx mailing list