[PATCH] drm/amdgpu: disable cstate properly for driver reloading scenario

Mon Mar 6 05:42:27 UTC 2023

On 3/2/2023 9:05 PM, Lazar, Lijo wrote:
> 
> 
> On 3/2/2023 8:56 PM, Deucher, Alexander wrote:
>> [AMD Official Use Only - General]
>>
>>> -----Original Message-----
>>> From: Quan, Evan <Evan.Quan at amd.com>
>>> Sent: Thursday, March 2, 2023 4:31 AM
>>> To: Lazar, Lijo <Lijo.Lazar at amd.com>; amd-gfx at lists.freedesktop.org;
>>> Deucher, Alexander <Alexander.Deucher at amd.com>
>>> Subject: RE: [PATCH] drm/amdgpu: disable cstate properly for driver
>>> reloading scenario
>>>
>>> [AMD Official Use Only - General]
>>>
>>>
>>>
>>>> -----Original Message-----
>>>> From: Lazar, Lijo <Lijo.Lazar at amd.com>
>>>> Sent: Thursday, March 2, 2023 5:21 PM
>>>> To: Quan, Evan <Evan.Quan at amd.com>; amd-gfx at lists.freedesktop.org
>>>> Cc: Deucher, Alexander <Alexander.Deucher at amd.com>
>>>> Subject: Re: [PATCH] drm/amdgpu: disable cstate properly for driver
>>>> reloading scenario
>>>>
>>>>
>>>>
>>>> On 3/2/2023 2:43 PM, Quan, Evan wrote:
>>>>> [AMD Official Use Only - General]
>>>>>
>>>>>
>>>>>
>>>>>> -----Original Message-----
>>>>>> From: Lazar, Lijo <Lijo.Lazar at amd.com>
>>>>>> Sent: Thursday, March 2, 2023 4:28 PM
>>>>>> To: Quan, Evan <Evan.Quan at amd.com>; amd-
>>> gfx at lists.freedesktop.org
>>>>>> Cc: Deucher, Alexander <Alexander.Deucher at amd.com>
>>>>>> Subject: Re: [PATCH] drm/amdgpu: disable cstate properly for driver
>>>>>> reloading scenario
>>>>>>
>>>>>>
>>>>>>
>>>>>> On 3/2/2023 12:28 PM, Evan Quan wrote:
>>>>>>> Gpu reset might be needed during driver reloading. To guard
>>>>>>> that(gpu
>>>>>>> reset) work, df cstate needs to be disabled properly.
>>>>>>>
>>>>>>> Signed-off-by: Evan Quan <evan.quan at amd.com>
>>>>>>> Change-Id: I5c074c265c0b08a67b6934ae1ad9aa3fed245461
>>>>>>> ---
>>>>>>>     drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 9 +++++++++
>>>>>>>     1 file changed, 9 insertions(+)
>>>>>>>
>>>>>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>>>>>>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>>>>>>> index 51bbeaa1f311..3c854461ef32 100644
>>>>>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>>>>>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>>>>>>> @@ -2816,6 +2816,15 @@ static int
>>>>>>> amdgpu_device_ip_fini_early(struct
>>>>>> amdgpu_device *adev)
>>>>>>>         amdgpu_device_set_pg_state(adev,
>>> AMD_PG_STATE_UNGATE);
>>>>>>>         amdgpu_device_set_cg_state(adev,
>>> AMD_CG_STATE_UNGATE);
>>>>>>>
>>>>>>> +    /*
>>>>>>> +     * Get df cstate disabled properly on driver unloading.
>>>>>>> +     * Since on the succeeding driver reloading, gpu reset might
>>>>>>> +     * be required. And cstate disabled is a prerequisite for
>>>>>>> +     * that(gpu reset).
>>>>>>> +     */
>>>>>>> +    if (amdgpu_dpm_set_df_cstate(adev, DF_CSTATE_DISALLOW))
>>>>>>> +        dev_warn(adev->dev, "Failed to disallow df cstate");
>>>>>>> +
>>>>>>
>>>>>> This looks more like a firmware bug. Driver sends the Unload
>>>>>> message to
>>>> FW.
>>>>>> In that case FW should disable all features including C-state.
>>>>> Driver does not send the Unload message. We want PMFM alive and
>>>>> ready
>>>> for handling possible gpu reset on reloading.
>>>>>
>>>>
>>>> Actually, soc21_need_reset_on_init code itself has a bug. PSP won't
>>>> get unloaded by default on ring destruction. Even if PSP stops, it
>>>> could just keep the heartbeat value as non-zero (just that it won't
>>> increment).
>>>>
>>>> Probably, that needs to be fixed first rather than keeping PMFW alive
>>>> for a reset.
>>> As I remembered, the change(asic reset during reloading) seemed
>>> introduced to address some sriov issues.
>>> @Deucher, Alexander might share more backgrounds about this.
>>> To be honest, I'm not a fan of this(perform asic reset during 
>>> reloading).
>>
>> I'm open to doing it a better way.  We did it for two reasons:
>> 1. often times the device was left in a weird state after the driver 
>> unload/VM killed. Etc.  We needed a way to put the device into a known 
>> good state so the driver could re-initialize it.  Plus, IIRC, on some 
>> of the older ASICS, once the SMU or PSP firmware was loaded, there was 
>> no way to reload it without a reset so you needed one anyway.  This is 
>> largely why we have to reset for S4 as well.
>> 2. Some large servers didn't power off PCI devices on reboots to save 
>> time.  This left the devices with whatever state they had before the 
>> system was rebooted which led to driver initialization problems on 
>> subsequent boots because the device was in an unknown state.
>>
>> If there is a better way to handle these situations, I'm all for it.

Hi Alex,

There is a part of FW running to handle generic reset requests on SOC21 
SOCs. Can we remove the reset-on-reload for soc21 SOCs, unload PMFW as 
usual and see how it goes?

Thanks,
Lijo

>>
> 
> Are those cases valid still? We have this for GFX9 - reset done only for 
> pass through. And some of the GFX9 ASICs are used in large servers.
> 
>          /* Just return false for soc15 GPUs.  Reset does not seem to
>           * be necessary.
>           */
>          if (!amdgpu_passthrough(adev))
>                  return false;
> 
> Thanks,
> Lijo
> 
>> Alex
>>
>>
>>>
>>> Evan
>>>>
>>>> Thanks,
>>>> Lijo
>>>>
>>>>> BR
>>>>> Evan
>>>>>>
>>>>>> Thanks,
>>>>>> Lijo
>>>>>>
>>>>>>>         amdgpu_amdkfd_suspend(adev, false);
>>>>>>>
>>>>>>>         /* Workaroud for ASICs need to disable SMC first */