[PATCH] drm/amdgpu: disable cstate properly for driver reloading scenario

Lazar, Lijo lijo.lazar at amd.com
Thu Mar 2 15:35:54 UTC 2023



On 3/2/2023 8:56 PM, Deucher, Alexander wrote:
> [AMD Official Use Only - General]
> 
>> -----Original Message-----
>> From: Quan, Evan <Evan.Quan at amd.com>
>> Sent: Thursday, March 2, 2023 4:31 AM
>> To: Lazar, Lijo <Lijo.Lazar at amd.com>; amd-gfx at lists.freedesktop.org;
>> Deucher, Alexander <Alexander.Deucher at amd.com>
>> Subject: RE: [PATCH] drm/amdgpu: disable cstate properly for driver
>> reloading scenario
>>
>> [AMD Official Use Only - General]
>>
>>
>>
>>> -----Original Message-----
>>> From: Lazar, Lijo <Lijo.Lazar at amd.com>
>>> Sent: Thursday, March 2, 2023 5:21 PM
>>> To: Quan, Evan <Evan.Quan at amd.com>; amd-gfx at lists.freedesktop.org
>>> Cc: Deucher, Alexander <Alexander.Deucher at amd.com>
>>> Subject: Re: [PATCH] drm/amdgpu: disable cstate properly for driver
>>> reloading scenario
>>>
>>>
>>>
>>> On 3/2/2023 2:43 PM, Quan, Evan wrote:
>>>> [AMD Official Use Only - General]
>>>>
>>>>
>>>>
>>>>> -----Original Message-----
>>>>> From: Lazar, Lijo <Lijo.Lazar at amd.com>
>>>>> Sent: Thursday, March 2, 2023 4:28 PM
>>>>> To: Quan, Evan <Evan.Quan at amd.com>; amd-
>> gfx at lists.freedesktop.org
>>>>> Cc: Deucher, Alexander <Alexander.Deucher at amd.com>
>>>>> Subject: Re: [PATCH] drm/amdgpu: disable cstate properly for driver
>>>>> reloading scenario
>>>>>
>>>>>
>>>>>
>>>>> On 3/2/2023 12:28 PM, Evan Quan wrote:
>>>>>> Gpu reset might be needed during driver reloading. To guard
>>>>>> that(gpu
>>>>>> reset) work, df cstate needs to be disabled properly.
>>>>>>
>>>>>> Signed-off-by: Evan Quan <evan.quan at amd.com>
>>>>>> Change-Id: I5c074c265c0b08a67b6934ae1ad9aa3fed245461
>>>>>> ---
>>>>>>     drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 9 +++++++++
>>>>>>     1 file changed, 9 insertions(+)
>>>>>>
>>>>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>>>>>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>>>>>> index 51bbeaa1f311..3c854461ef32 100644
>>>>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>>>>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>>>>>> @@ -2816,6 +2816,15 @@ static int
>>>>>> amdgpu_device_ip_fini_early(struct
>>>>> amdgpu_device *adev)
>>>>>>     	amdgpu_device_set_pg_state(adev,
>> AMD_PG_STATE_UNGATE);
>>>>>>     	amdgpu_device_set_cg_state(adev,
>> AMD_CG_STATE_UNGATE);
>>>>>>
>>>>>> +	/*
>>>>>> +	 * Get df cstate disabled properly on driver unloading.
>>>>>> +	 * Since on the succeeding driver reloading, gpu reset might
>>>>>> +	 * be required. And cstate disabled is a prerequisite for
>>>>>> +	 * that(gpu reset).
>>>>>> +	 */
>>>>>> +	if (amdgpu_dpm_set_df_cstate(adev, DF_CSTATE_DISALLOW))
>>>>>> +		dev_warn(adev->dev, "Failed to disallow df cstate");
>>>>>> +
>>>>>
>>>>> This looks more like a firmware bug. Driver sends the Unload
>>>>> message to
>>> FW.
>>>>> In that case FW should disable all features including C-state.
>>>> Driver does not send the Unload message. We want PMFM alive and
>>>> ready
>>> for handling possible gpu reset on reloading.
>>>>
>>>
>>> Actually, soc21_need_reset_on_init code itself has a bug. PSP won't
>>> get unloaded by default on ring destruction. Even if PSP stops, it
>>> could just keep the heartbeat value as non-zero (just that it won't
>> increment).
>>>
>>> Probably, that needs to be fixed first rather than keeping PMFW alive
>>> for a reset.
>> As I remembered, the change(asic reset during reloading) seemed
>> introduced to address some sriov issues.
>> @Deucher, Alexander might share more backgrounds about this.
>> To be honest, I'm not a fan of this(perform asic reset during reloading).
> 
> I'm open to doing it a better way.  We did it for two reasons:
> 1. often times the device was left in a weird state after the driver unload/VM killed. Etc.  We needed a way to put the device into a known good state so the driver could re-initialize it.  Plus, IIRC, on some of the older ASICS, once the SMU or PSP firmware was loaded, there was no way to reload it without a reset so you needed one anyway.  This is largely why we have to reset for S4 as well.
> 2. Some large servers didn't power off PCI devices on reboots to save time.  This left the devices with whatever state they had before the system was rebooted which led to driver initialization problems on subsequent boots because the device was in an unknown state.
> 
> If there is a better way to handle these situations, I'm all for it.
> 

Are those cases valid still? We have this for GFX9 - reset done only for 
pass through. And some of the GFX9 ASICs are used in large servers.

         /* Just return false for soc15 GPUs.  Reset does not seem to
          * be necessary.
          */
         if (!amdgpu_passthrough(adev))
                 return false;

Thanks,
Lijo

> Alex
> 
> 
>>
>> Evan
>>>
>>> Thanks,
>>> Lijo
>>>
>>>> BR
>>>> Evan
>>>>>
>>>>> Thanks,
>>>>> Lijo
>>>>>
>>>>>>     	amdgpu_amdkfd_suspend(adev, false);
>>>>>>
>>>>>>     	/* Workaroud for ASICs need to disable SMC first */


More information about the amd-gfx mailing list