[PATCH] drm/amdgpu: add mb for si

Thu Nov 24 10:41:22 UTC 2022

On 11/24/2022 3:34 PM, Quan, Evan wrote:
> [AMD Official Use Only - General]
> 
> Could the attached patch help?
> 
> Evan
>> -----Original Message-----
>> From: amd-gfx <amd-gfx-bounces at lists.freedesktop.org> On Behalf Of ???
>> Sent: Friday, November 18, 2022 5:25 PM
>> To: Michel Dänzer <michel.daenzer at mailbox.org>; Koenig, Christian
>> <Christian.Koenig at amd.com>; Deucher, Alexander
>> <Alexander.Deucher at amd.com>
>> Cc: amd-gfx at lists.freedesktop.org; Pan, Xinhui <Xinhui.Pan at amd.com>;
>> linux-kernel at vger.kernel.org; dri-devel at lists.freedesktop.org
>> Subject: Re: [PATCH] drm/amdgpu: add mb for si
>>
>>
>> 在 2022/11/18 17:18, Michel Dänzer 写道:
>>> On 11/18/22 09:01, Christian König wrote:
>>>> Am 18.11.22 um 08:48 schrieb Zhenneng Li:
>>>>> During reboot test on arm64 platform, it may failure on boot, so add
>>>>> this mb in smc.
>>>>>
>>>>> The error message are as follows:
>>>>> [    6.996395][ 7] [  T295] [drm:amdgpu_device_ip_late_init
>>>>> [amdgpu]] *ERROR*
>>>>>                   late_init of IP block <si_dpm> failed -22 [
>>>>> 7.006919][ 7] [  T295] amdgpu 0000:04:00.0:

The issue is happening in late_init() which eventually does

	ret = si_thermal_enable_alert(adev, false);

Just before this, si_thermal_start_thermal_controller is called in 
hw_init and that enables thermal alert.

Maybe the issue is with enable/disable of thermal alerts in quick 
succession. Adding a delay inside si_thermal_start_thermal_controller 
might help.

Thanks,
Lijo

>>>>> amdgpu_device_ip_late_init failed [    7.014224][ 7] [  T295] amdgpu
>>>>> 0000:04:00.0: Fatal error during GPU init
>>>> Memory barries are not supposed to be sprinkled around like this, you
>> need to give a detailed explanation why this is necessary.
>>>>
>>>> Regards,
>>>> Christian.
>>>>
>>>>> Signed-off-by: Zhenneng Li <lizhenneng at kylinos.cn>
>>>>> ---
>>>>>     drivers/gpu/drm/amd/pm/legacy-dpm/si_smc.c | 2 ++
>>>>>     1 file changed, 2 insertions(+)
>>>>>
>>>>> diff --git a/drivers/gpu/drm/amd/pm/legacy-dpm/si_smc.c
>>>>> b/drivers/gpu/drm/amd/pm/legacy-dpm/si_smc.c
>>>>> index 8f994ffa9cd1..c7656f22278d 100644
>>>>> --- a/drivers/gpu/drm/amd/pm/legacy-dpm/si_smc.c
>>>>> +++ b/drivers/gpu/drm/amd/pm/legacy-dpm/si_smc.c
>>>>> @@ -155,6 +155,8 @@ bool amdgpu_si_is_smc_running(struct
>>>>> amdgpu_device *adev)
>>>>>         u32 rst = RREG32_SMC(SMC_SYSCON_RESET_CNTL);
>>>>>         u32 clk = RREG32_SMC(SMC_SYSCON_CLOCK_CNTL_0);
>>>>>     +    mb();
>>>>> +
>>>>>         if (!(rst & RST_REG) && !(clk & CK_DISABLE))
>>>>>             return true;
>>> In particular, it makes no sense in this specific place, since it cannot directly
>> affect the values of rst & clk.
>>
>> I thinks so too.
>>
>> But when I do reboot test using nine desktop machines,  there maybe report
>> this error on one or two machines after Hundreds of times or Thousands of
>> times reboot test, at the beginning, I use msleep() instead of mb(), these
>> two methods are all works, but I don't know what is the root case.
>>
>> I use this method on other verdor's oland card, this error message are
>> reported again.
>>
>> What could be the root reason?
>>
>> test environmen:
>>
>> graphics card: OLAND 0x1002:0x6611 0x1642:0x1869 0x87
>>
>> driver: amdgpu
>>
>> os: ubuntu 2004
>>
>> platform: arm64
>>
>> kernel: 5.4.18
>>
>>>