[PATCH] drm/amdgpu: disable cstate properly for driver reloading scenario

Deucher, Alexander Alexander.Deucher at amd.com
Mon Mar 6 17:57:39 UTC 2023


[AMD Official Use Only - General]

> -----Original Message-----
> From: Lazar, Lijo <Lijo.Lazar at amd.com>
> Sent: Monday, March 6, 2023 12:42 AM
> To: Deucher, Alexander <Alexander.Deucher at amd.com>; Quan, Evan
> <Evan.Quan at amd.com>; amd-gfx at lists.freedesktop.org
> Subject: Re: [PATCH] drm/amdgpu: disable cstate properly for driver
> reloading scenario
> 
> 
> 
> On 3/2/2023 9:05 PM, Lazar, Lijo wrote:
> >
> >
> > On 3/2/2023 8:56 PM, Deucher, Alexander wrote:
> >> [AMD Official Use Only - General]
> >>
> >>> -----Original Message-----
> >>> From: Quan, Evan <Evan.Quan at amd.com>
> >>> Sent: Thursday, March 2, 2023 4:31 AM
> >>> To: Lazar, Lijo <Lijo.Lazar at amd.com>; amd-gfx at lists.freedesktop.org;
> >>> Deucher, Alexander <Alexander.Deucher at amd.com>
> >>> Subject: RE: [PATCH] drm/amdgpu: disable cstate properly for driver
> >>> reloading scenario
> >>>
> >>> [AMD Official Use Only - General]
> >>>
> >>>
> >>>
> >>>> -----Original Message-----
> >>>> From: Lazar, Lijo <Lijo.Lazar at amd.com>
> >>>> Sent: Thursday, March 2, 2023 5:21 PM
> >>>> To: Quan, Evan <Evan.Quan at amd.com>; amd-
> gfx at lists.freedesktop.org
> >>>> Cc: Deucher, Alexander <Alexander.Deucher at amd.com>
> >>>> Subject: Re: [PATCH] drm/amdgpu: disable cstate properly for driver
> >>>> reloading scenario
> >>>>
> >>>>
> >>>>
> >>>> On 3/2/2023 2:43 PM, Quan, Evan wrote:
> >>>>> [AMD Official Use Only - General]
> >>>>>
> >>>>>
> >>>>>
> >>>>>> -----Original Message-----
> >>>>>> From: Lazar, Lijo <Lijo.Lazar at amd.com>
> >>>>>> Sent: Thursday, March 2, 2023 4:28 PM
> >>>>>> To: Quan, Evan <Evan.Quan at amd.com>; amd-
> >>> gfx at lists.freedesktop.org
> >>>>>> Cc: Deucher, Alexander <Alexander.Deucher at amd.com>
> >>>>>> Subject: Re: [PATCH] drm/amdgpu: disable cstate properly for
> >>>>>> driver reloading scenario
> >>>>>>
> >>>>>>
> >>>>>>
> >>>>>> On 3/2/2023 12:28 PM, Evan Quan wrote:
> >>>>>>> Gpu reset might be needed during driver reloading. To guard
> >>>>>>> that(gpu
> >>>>>>> reset) work, df cstate needs to be disabled properly.
> >>>>>>>
> >>>>>>> Signed-off-by: Evan Quan <evan.quan at amd.com>
> >>>>>>> Change-Id: I5c074c265c0b08a67b6934ae1ad9aa3fed245461
> >>>>>>> ---
> >>>>>>>     drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 9
> +++++++++
> >>>>>>>     1 file changed, 9 insertions(+)
> >>>>>>>
> >>>>>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> >>>>>>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> >>>>>>> index 51bbeaa1f311..3c854461ef32 100644
> >>>>>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> >>>>>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> >>>>>>> @@ -2816,6 +2816,15 @@ static int
> >>>>>>> amdgpu_device_ip_fini_early(struct
> >>>>>> amdgpu_device *adev)
> >>>>>>>         amdgpu_device_set_pg_state(adev,
> >>> AMD_PG_STATE_UNGATE);
> >>>>>>>         amdgpu_device_set_cg_state(adev,
> >>> AMD_CG_STATE_UNGATE);
> >>>>>>>
> >>>>>>> +    /*
> >>>>>>> +     * Get df cstate disabled properly on driver unloading.
> >>>>>>> +     * Since on the succeeding driver reloading, gpu reset
> >>>>>>> +might
> >>>>>>> +     * be required. And cstate disabled is a prerequisite for
> >>>>>>> +     * that(gpu reset).
> >>>>>>> +     */
> >>>>>>> +    if (amdgpu_dpm_set_df_cstate(adev, DF_CSTATE_DISALLOW))
> >>>>>>> +        dev_warn(adev->dev, "Failed to disallow df cstate");
> >>>>>>> +
> >>>>>>
> >>>>>> This looks more like a firmware bug. Driver sends the Unload
> >>>>>> message to
> >>>> FW.
> >>>>>> In that case FW should disable all features including C-state.
> >>>>> Driver does not send the Unload message. We want PMFM alive and
> >>>>> ready
> >>>> for handling possible gpu reset on reloading.
> >>>>>
> >>>>
> >>>> Actually, soc21_need_reset_on_init code itself has a bug. PSP won't
> >>>> get unloaded by default on ring destruction. Even if PSP stops, it
> >>>> could just keep the heartbeat value as non-zero (just that it won't
> >>> increment).
> >>>>
> >>>> Probably, that needs to be fixed first rather than keeping PMFW
> >>>> alive for a reset.
> >>> As I remembered, the change(asic reset during reloading) seemed
> >>> introduced to address some sriov issues.
> >>> @Deucher, Alexander might share more backgrounds about this.
> >>> To be honest, I'm not a fan of this(perform asic reset during
> >>> reloading).
> >>
> >> I'm open to doing it a better way.  We did it for two reasons:
> >> 1. often times the device was left in a weird state after the driver
> >> unload/VM killed. Etc.  We needed a way to put the device into a
> >> known good state so the driver could re-initialize it.  Plus, IIRC,
> >> on some of the older ASICS, once the SMU or PSP firmware was loaded,
> >> there was no way to reload it without a reset so you needed one
> >> anyway.  This is largely why we have to reset for S4 as well.
> >> 2. Some large servers didn't power off PCI devices on reboots to save
> >> time.  This left the devices with whatever state they had before the
> >> system was rebooted which led to driver initialization problems on
> >> subsequent boots because the device was in an unknown state.
> >>
> >> If there is a better way to handle these situations, I'm all for it.
> 
> Hi Alex,
> 
> There is a part of FW running to handle generic reset requests on SOC21
> SOCs. Can we remove the reset-on-reload for soc21 SOCs, unload PMFW as
> usual and see how it goes?

Sure.

Alex

> 
> Thanks,
> Lijo
> 
> >>
> >
> > Are those cases valid still? We have this for GFX9 - reset done only
> > for pass through. And some of the GFX9 ASICs are used in large servers.
> >
> >          /* Just return false for soc15 GPUs.  Reset does not seem to
> >           * be necessary.
> >           */
> >          if (!amdgpu_passthrough(adev))
> >                  return false;
> >
> > Thanks,
> > Lijo
> >
> >> Alex
> >>
> >>
> >>>
> >>> Evan
> >>>>
> >>>> Thanks,
> >>>> Lijo
> >>>>
> >>>>> BR
> >>>>> Evan
> >>>>>>
> >>>>>> Thanks,
> >>>>>> Lijo
> >>>>>>
> >>>>>>>         amdgpu_amdkfd_suspend(adev, false);
> >>>>>>>
> >>>>>>>         /* Workaroud for ASICs need to disable SMC first */


More information about the amd-gfx mailing list