[PATCH] drm/amdgpu: add mb for si

Fri Nov 25 06:12:44 UTC 2022

On 11/25/2022 7:43 AM, Quan, Evan wrote:
> [AMD Official Use Only - General]
>
>
>
>> -----Original Message-----
>> From: Lazar, Lijo <Lijo.Lazar at amd.com>
>> Sent: Thursday, November 24, 2022 6:49 PM
>> To: Quan, Evan <Evan.Quan at amd.com>; 李真能 <lizhenneng at kylinos.cn>;
>> Michel Dänzer <michel.daenzer at mailbox.org>; Koenig, Christian
>> <Christian.Koenig at amd.com>; Deucher, Alexander
>> <Alexander.Deucher at amd.com>
>> Cc: amd-gfx at lists.freedesktop.org; Pan, Xinhui <Xinhui.Pan at amd.com>;
>> linux-kernel at vger.kernel.org; dri-devel at lists.freedesktop.org
>> Subject: Re: [PATCH] drm/amdgpu: add mb for si
>>
>>
>>
>> On 11/24/2022 4:11 PM, Lazar, Lijo wrote:
>>>
>>> On 11/24/2022 3:34 PM, Quan, Evan wrote:
>>>> [AMD Official Use Only - General]
>>>>
>>>> Could the attached patch help?
>>>>
>>>> Evan
>>>>> -----Original Message-----
>>>>> From: amd-gfx <amd-gfx-bounces at lists.freedesktop.org> On Behalf
>> Of ???
>>>>> Sent: Friday, November 18, 2022 5:25 PM
>>>>> To: Michel Dänzer <michel.daenzer at mailbox.org>; Koenig, Christian
>>>>> <Christian.Koenig at amd.com>; Deucher, Alexander
>>>>> <Alexander.Deucher at amd.com>
>>>>> Cc: amd-gfx at lists.freedesktop.org; Pan, Xinhui <Xinhui.Pan at amd.com>;
>>>>> linux-kernel at vger.kernel.org; dri-devel at lists.freedesktop.org
>>>>> Subject: Re: [PATCH] drm/amdgpu: add mb for si
>>>>>
>>>>>
>>>>> 在 2022/11/18 17:18, Michel Dänzer 写道:
>>>>>> On 11/18/22 09:01, Christian König wrote:
>>>>>>> Am 18.11.22 um 08:48 schrieb Zhenneng Li:
>>>>>>>> During reboot test on arm64 platform, it may failure on boot, so
>>>>>>>> add this mb in smc.
>>>>>>>>
>>>>>>>> The error message are as follows:
>>>>>>>> [    6.996395][ 7] [  T295] [drm:amdgpu_device_ip_late_init
>>>>>>>> [amdgpu]] *ERROR*
>>>>>>>>                    late_init of IP block <si_dpm> failed -22 [
>>>>>>>> 7.006919][ 7] [  T295] amdgpu 0000:04:00.0:
>>> The issue is happening in late_init() which eventually does
>>>
>>>       ret = si_thermal_enable_alert(adev, false);
>>>
>>> Just before this, si_thermal_start_thermal_controller is called in
>>> hw_init and that enables thermal alert.
>>>
>>> Maybe the issue is with enable/disable of thermal alerts in quick
>>> succession. Adding a delay inside si_thermal_start_thermal_controller
>>> might help.
>>>
>> On a second look, temperature range is already set as part of
>> si_thermal_start_thermal_controller in hw_init
>> https://elixir.bootlin.com/linux/v6.1-
>> rc6/source/drivers/gpu/drm/amd/pm/legacy-dpm/si_dpm.c#L6780
>>
>> There is no need to set it again here -
>>
>> https://elixir.bootlin.com/linux/v6.1-
>> rc6/source/drivers/gpu/drm/amd/pm/legacy-dpm/si_dpm.c#L7635
>>
>> I think it is safe to remove the call from late_init altogether. Alex/Evan?
>>
> [Quan, Evan] Yes, it makes sense to me. But I'm not sure whether that’s related with the issue here.
> Since per my understandings, if the issue is caused by double calling of thermal_alert enablement, it will fail every time.
> That cannot explain why adding some delays or a mb() calling can help.

The side effect of the patch is just some random delay introduced for 
every SMC message

The issue happens in late_init(). Between late_init() and dpm 
enablement, there are many smc messages sent which don't have this 
issue. So I think the issue is not with FW not running.

Thus the only case I see is enable/disable of thermal alert in random 
succession.

Thanks,

Lijo

> BR
> Evan
>> Thanks,
>> Lijo
>>
>>> Thanks,
>>> Lijo
>>>
>>>>>>>> amdgpu_device_ip_late_init failed [    7.014224][ 7] [  T295] amdgpu
>>>>>>>> 0000:04:00.0: Fatal error during GPU init
>>>>>>> Memory barries are not supposed to be sprinkled around like this,
>> you
>>>>> need to give a detailed explanation why this is necessary.
>>>>>>> Regards,
>>>>>>> Christian.
>>>>>>>
>>>>>>>> Signed-off-by: Zhenneng Li <lizhenneng at kylinos.cn>
>>>>>>>> ---
>>>>>>>>      drivers/gpu/drm/amd/pm/legacy-dpm/si_smc.c | 2 ++
>>>>>>>>      1 file changed, 2 insertions(+)
>>>>>>>>
>>>>>>>> diff --git a/drivers/gpu/drm/amd/pm/legacy-dpm/si_smc.c
>>>>>>>> b/drivers/gpu/drm/amd/pm/legacy-dpm/si_smc.c
>>>>>>>> index 8f994ffa9cd1..c7656f22278d 100644
>>>>>>>> --- a/drivers/gpu/drm/amd/pm/legacy-dpm/si_smc.c
>>>>>>>> +++ b/drivers/gpu/drm/amd/pm/legacy-dpm/si_smc.c
>>>>>>>> @@ -155,6 +155,8 @@ bool amdgpu_si_is_smc_running(struct
>>>>>>>> amdgpu_device *adev)
>>>>>>>>          u32 rst = RREG32_SMC(SMC_SYSCON_RESET_CNTL);
>>>>>>>>          u32 clk = RREG32_SMC(SMC_SYSCON_CLOCK_CNTL_0);
>>>>>>>>      +    mb();
>>>>>>>> +
>>>>>>>>          if (!(rst & RST_REG) && !(clk & CK_DISABLE))
>>>>>>>>              return true;
>>>>>> In particular, it makes no sense in this specific place, since it
>>>>>> cannot directly
>>>>> affect the values of rst & clk.
>>>>>
>>>>> I thinks so too.
>>>>>
>>>>> But when I do reboot test using nine desktop machines,  there maybe
>>>>> report
>>>>> this error on one or two machines after Hundreds of times or
>>>>> Thousands of
>>>>> times reboot test, at the beginning, I use msleep() instead of mb(),
>>>>> these
>>>>> two methods are all works, but I don't know what is the root case.
>>>>>
>>>>> I use this method on other verdor's oland card, this error message are
>>>>> reported again.
>>>>>
>>>>> What could be the root reason?
>>>>>
>>>>> test environmen:
>>>>>
>>>>> graphics card: OLAND 0x1002:0x6611 0x1642:0x1869 0x87
>>>>>
>>>>> driver: amdgpu
>>>>>
>>>>> os: ubuntu 2004
>>>>>
>>>>> platform: arm64
>>>>>
>>>>> kernel: 5.4.18
>>>>>