[PATCH] drm/amdgpu: add mb for si

Fri Nov 25 02:13:46 UTC 2022

[AMD Official Use Only - General]


> -----Original Message-----
> From: Lazar, Lijo <Lijo.Lazar at amd.com>
> Sent: Thursday, November 24, 2022 6:49 PM
> To: Quan, Evan <Evan.Quan at amd.com>; 李真能 <lizhenneng at kylinos.cn>;
> Michel Dänzer <michel.daenzer at mailbox.org>; Koenig, Christian
> <Christian.Koenig at amd.com>; Deucher, Alexander
> <Alexander.Deucher at amd.com>
> Cc: amd-gfx at lists.freedesktop.org; Pan, Xinhui <Xinhui.Pan at amd.com>;
> linux-kernel at vger.kernel.org; dri-devel at lists.freedesktop.org
> Subject: Re: [PATCH] drm/amdgpu: add mb for si
> 
> 
> 
> On 11/24/2022 4:11 PM, Lazar, Lijo wrote:
> >
> >
> > On 11/24/2022 3:34 PM, Quan, Evan wrote:
> >> [AMD Official Use Only - General]
> >>
> >> Could the attached patch help?
> >>
> >> Evan
> >>> -----Original Message-----
> >>> From: amd-gfx <amd-gfx-bounces at lists.freedesktop.org> On Behalf
> Of ???
> >>> Sent: Friday, November 18, 2022 5:25 PM
> >>> To: Michel Dänzer <michel.daenzer at mailbox.org>; Koenig, Christian
> >>> <Christian.Koenig at amd.com>; Deucher, Alexander
> >>> <Alexander.Deucher at amd.com>
> >>> Cc: amd-gfx at lists.freedesktop.org; Pan, Xinhui <Xinhui.Pan at amd.com>;
> >>> linux-kernel at vger.kernel.org; dri-devel at lists.freedesktop.org
> >>> Subject: Re: [PATCH] drm/amdgpu: add mb for si
> >>>
> >>>
> >>> 在 2022/11/18 17:18, Michel Dänzer 写道:
> >>>> On 11/18/22 09:01, Christian König wrote:
> >>>>> Am 18.11.22 um 08:48 schrieb Zhenneng Li:
> >>>>>> During reboot test on arm64 platform, it may failure on boot, so
> >>>>>> add this mb in smc.
> >>>>>>
> >>>>>> The error message are as follows:
> >>>>>> [    6.996395][ 7] [  T295] [drm:amdgpu_device_ip_late_init
> >>>>>> [amdgpu]] *ERROR*
> >>>>>>                   late_init of IP block <si_dpm> failed -22 [
> >>>>>> 7.006919][ 7] [  T295] amdgpu 0000:04:00.0:
> >
> > The issue is happening in late_init() which eventually does
> >
> >      ret = si_thermal_enable_alert(adev, false);
> >
> > Just before this, si_thermal_start_thermal_controller is called in
> > hw_init and that enables thermal alert.
> >
> > Maybe the issue is with enable/disable of thermal alerts in quick
> > succession. Adding a delay inside si_thermal_start_thermal_controller
> > might help.
> >
> 
> On a second look, temperature range is already set as part of
> si_thermal_start_thermal_controller in hw_init
> https://elixir.bootlin.com/linux/v6.1-
> rc6/source/drivers/gpu/drm/amd/pm/legacy-dpm/si_dpm.c#L6780
> 
> There is no need to set it again here -
> 
> https://elixir.bootlin.com/linux/v6.1-
> rc6/source/drivers/gpu/drm/amd/pm/legacy-dpm/si_dpm.c#L7635
> 
> I think it is safe to remove the call from late_init altogether. Alex/Evan?
> 
[Quan, Evan] Yes, it makes sense to me. But I'm not sure whether that’s related with the issue here.
Since per my understandings, if the issue is caused by double calling of thermal_alert enablement, it will fail every time.
That cannot explain why adding some delays or a mb() calling can help.

BR
Evan
> Thanks,
> Lijo
> 
> > Thanks,
> > Lijo
> >
> >>>>>> amdgpu_device_ip_late_init failed [    7.014224][ 7] [  T295] amdgpu
> >>>>>> 0000:04:00.0: Fatal error during GPU init
> >>>>> Memory barries are not supposed to be sprinkled around like this,
> you
> >>> need to give a detailed explanation why this is necessary.
> >>>>>
> >>>>> Regards,
> >>>>> Christian.
> >>>>>
> >>>>>> Signed-off-by: Zhenneng Li <lizhenneng at kylinos.cn>
> >>>>>> ---
> >>>>>>     drivers/gpu/drm/amd/pm/legacy-dpm/si_smc.c | 2 ++
> >>>>>>     1 file changed, 2 insertions(+)
> >>>>>>
> >>>>>> diff --git a/drivers/gpu/drm/amd/pm/legacy-dpm/si_smc.c
> >>>>>> b/drivers/gpu/drm/amd/pm/legacy-dpm/si_smc.c
> >>>>>> index 8f994ffa9cd1..c7656f22278d 100644
> >>>>>> --- a/drivers/gpu/drm/amd/pm/legacy-dpm/si_smc.c
> >>>>>> +++ b/drivers/gpu/drm/amd/pm/legacy-dpm/si_smc.c
> >>>>>> @@ -155,6 +155,8 @@ bool amdgpu_si_is_smc_running(struct
> >>>>>> amdgpu_device *adev)
> >>>>>>         u32 rst = RREG32_SMC(SMC_SYSCON_RESET_CNTL);
> >>>>>>         u32 clk = RREG32_SMC(SMC_SYSCON_CLOCK_CNTL_0);
> >>>>>>     +    mb();
> >>>>>> +
> >>>>>>         if (!(rst & RST_REG) && !(clk & CK_DISABLE))
> >>>>>>             return true;
> >>>> In particular, it makes no sense in this specific place, since it
> >>>> cannot directly
> >>> affect the values of rst & clk.
> >>>
> >>> I thinks so too.
> >>>
> >>> But when I do reboot test using nine desktop machines,  there maybe
> >>> report
> >>> this error on one or two machines after Hundreds of times or
> >>> Thousands of
> >>> times reboot test, at the beginning, I use msleep() instead of mb(),
> >>> these
> >>> two methods are all works, but I don't know what is the root case.
> >>>
> >>> I use this method on other verdor's oland card, this error message are
> >>> reported again.
> >>>
> >>> What could be the root reason?
> >>>
> >>> test environmen:
> >>>
> >>> graphics card: OLAND 0x1002:0x6611 0x1642:0x1869 0x87
> >>>
> >>> driver: amdgpu
> >>>
> >>> os: ubuntu 2004
> >>>
> >>> platform: arm64
> >>>
> >>> kernel: 5.4.18
> >>>
> >>>>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: winmail.dat
Type: application/ms-tnef
Size: 18422 bytes
Desc: not available
URL: <https://lists.freedesktop.org/archives/amd-gfx/attachments/20221125/7eb06a1a/attachment-0001.bin>