[PATCH] drm/amdgpu: resove reboot exception for si oland

Chen, Guchun Guchun.Chen at amd.com
Mon Mar 13 02:04:25 UTC 2023


Thanks for sharing previous context on this. 

By a further code check, I found something interesting in si_dpm_late_init that why there is an early return when dpm_enabled has been true.
The sequence to enable temperature range in boot is:
1. In IP hw_init(si_dpm_hw_init) ahead of late_init, set temperature range as part of si_thermal_start_thermal_controller
2. set adev->pm.dpm_enabled to true unconditionally in si_dpm_hw_init
3. In si_dpm_late_init, temperate range setting is still executed as we put a check "if (!adev->pm.dpm_enabled) return 0". Looks we should skip it when dpm including temperature range has been set already.

So I guess the random failure in enabling/disabling thermal alert is possibly by amdgpu driver does not check the return value when setting temperature in hw_init phase, FW randomly has not finished the process yet, while immediately, driver issues another same setting cycle to FW, and FW complains/returns an error code to driver. This may explain why a delay can work in such case. Or I am understanding this wrongly due to my limitation?

Hi Zhenneng,

Additionally, can you please try to modify the check to return early in si_dpm_late_init when adev->pm.dpm_enabled is true?

[Also I dropped some public mail lists as looks such issue is amdgpu driver specific]:)

> -----Original Message-----
> From: 李真能 <lizhenneng at kylinos.cn>
> Sent: Monday, March 13, 2023 9:05 AM
> To: Chen, Guchun <Guchun.Chen at amd.com>; Deucher, Alexander
> <Alexander.Deucher at amd.com>
> Cc: David Airlie <airlied at linux.ie>; Pan, Xinhui <Xinhui.Pan at amd.com>;
> linux-kernel at vger.kernel.org; dri-devel at lists.freedesktop.org; amd-
> gfx at lists.freedesktop.org; Daniel Vetter <daniel at ffwll.ch>; Koenig, Christian
> <Christian.Koenig at amd.com>
> Subject: Re: [PATCH] drm/amdgpu: resove reboot exception for si oland
> 
> This bug is first reported here:
> 
> https://lore.kernel.org/lkml/1a620e7c-5b71-3d16-001a-
> 0d79b292aca7 at amd.com/
> 
> I modify the patch accroding mail list's discusstion,   and I do reboot test for
> tens of thousands of times about 10 machines on arm64,  there's no bug
> reported.
> 
> 在 2023/3/10 16:18, Chen, Guchun 写道:
> >> -----Original Message-----
> >> From: amd-gfx <amd-gfx-bounces at lists.freedesktop.org> On Behalf Of
> >> Zhenneng Li
> >> Sent: Friday, March 10, 2023 3:40 PM
> >> To: Deucher, Alexander <Alexander.Deucher at amd.com>
> >> Cc: David Airlie <airlied at linux.ie>; Pan, Xinhui
> >> <Xinhui.Pan at amd.com>; linux-kernel at vger.kernel.org;
> >> dri-devel at lists.freedesktop.org; Zhenneng Li <lizhenneng at kylinos.cn>;
> >> amd-gfx at lists.freedesktop.org; Daniel Vetter <daniel at ffwll.ch>;
> >> Koenig, Christian <Christian.Koenig at amd.com>
> >> Subject: [PATCH] drm/amdgpu: resove reboot exception for si oland
> >>
> >> During reboot test on arm64 platform, it may failure on boot.
> >>
> >> The error message are as follows:
> >> [    6.996395][ 7] [  T295] [drm:amdgpu_device_ip_late_init [amdgpu]]
> >> *ERROR*
> >> 			    late_init of IP block <si_dpm> failed -22
> >> [    7.006919][ 7] [  T295] amdgpu 0000:04:00.0:
> amdgpu_device_ip_late_init
> >> failed
> >> [    7.014224][ 7] [  T295] amdgpu 0000:04:00.0: Fatal error during GPU init
> >> ---
> >>   drivers/gpu/drm/amd/pm/legacy-dpm/si_dpm.c | 3 ---
> >>   1 file changed, 3 deletions(-)
> >>
> >> diff --git a/drivers/gpu/drm/amd/pm/legacy-dpm/si_dpm.c
> >> b/drivers/gpu/drm/amd/pm/legacy-dpm/si_dpm.c
> >> index d6d9e3b1b2c0..dee51c757ac0 100644
> >> --- a/drivers/gpu/drm/amd/pm/legacy-dpm/si_dpm.c
> >> +++ b/drivers/gpu/drm/amd/pm/legacy-dpm/si_dpm.c
> >> @@ -7632,9 +7632,6 @@ static int si_dpm_late_init(void *handle)
> >>   	if (!adev->pm.dpm_enabled)
> >>   		return 0;
> >>
> >> -	ret = si_set_temperature_range(adev);
> >> -	if (ret)
> >> -		return ret;
> > si_set_temperature_range should be platform agnostic. Can you please
> elaborate more?
> >
> > Regards,
> > Guchun
> >
> >>   #if 0 //TODO ?
> >>   	si_dpm_powergate_uvd(adev, true);
> >>   #endif
> >> --
> >> 2.25.1


More information about the amd-gfx mailing list