[PATCH v3 1/7] drm/amdgpu: update XGMI physical node id and GMC configs on resume
Christian König
christian.koenig at amd.com
Wed May 7 12:56:36 UTC 2025
On 5/7/25 14:49, Sam wrote:
>
> On 2025/5/7 20:21, Christian König wrote:
>> On 5/7/25 13:03, Sam wrote:
>>> On 2025/5/7 18:03, Lazar, Lijo wrote:
>>>> On 5/7/2025 11:52 AM, Zhang, GuoQing (Sam) wrote:
>>>>> [AMD Official Use Only - AMD Internal Distribution Only]
>>>>>
>>>>>
>>>>>
>>>>>> Please keep in mind that this is not the only scenario addressed by the
>>>>>> driver - for ex: a resume sequence is executed after a device reset.
>>>>>> This patch itself introduces unwanted sequences for other commonly used
>>>>>> usecases. Please rework on the series without breaking existing usecases.
>>>>>> Thanks,
>>>>>> Lijo
>>>>> Hi @Lazar, Lijo<mailto:Lijo.Lazar at amd.com>, Thank you for the feedback.
>>>>>
>>>>> I also think the new code should be inside a check so that new code is
>>>>> executed only on resume with different VF and do not break existing
>>>>> usecases. Following is the implementation of this approach I can think of.
>>>>>
>>>>> - introduce new field `prev_physical_node_id ` in `struct amdgpu_xgmi `.
>>>>> update the fields on resume.
>>>>>
>>>>> - put new code inside code block `if (prev_physical_node_id !=
>>>>> physical_node_id )`
>>>>>
>>>>>
>>>> Can this happen only with XGMI under this condition? Any other method
>>>> possible like preparing a 'unique signature' and matching it to identify
>>>> if it resumed on an identically configured system?
>>> Yes, this hibernate-resume with different VF feature is only for devices with XGMI. Detecting XGMI node id change is the only way I can think of to identify the case. It's also a very simple way.
>>>
>>> @Koenig, Christian <mailto:Christian.Koenig at amd.com> Are you OK with this approach, adding a check for the new code sequence?
>>
>> Well you still need to avoid calling gmc_v9_0_mc_init() since that is most likely incorrect.
> Yes, I will change it to
>
> if (amdgpu_xmgi_is_node_changed(adev))
> gmc_v9_0_vram_gtt_location(adev, &adev->gmc);
Even that is incorrect. The VRAM and GTT location can't change on resume.
What changes are the XGMI node ID and with it where inside the FB aperture our VRAM PDB0 should point to.
Regards,
Christian.
>
> And remove the change of `refresh`.
>
> Regards
> Sam
>
>
>>
>> Regards,
>> Christian.
>>
>>>> Regardless, instead of having a direct check, better to wrap it inside
>>>> something like
>>>> if (amdgpu_virt_need_migration()) or something more appropriate.
>>> Yes, I will do that. Thank you!
>>>
>>> Regards
>>> Sam
>>>
>>>> Thanks,
>>>> Lijo
>>>>
>>>>> Is this approach acceptable? If not, can you suggest a better approach?
>>>>> @Lazar, Lijo<mailto:Lijo.Lazar at amd.com> @Koenig, Christian
>>>>> <mailto:Christian.Koenig at amd.com> Thank you!
>>>>>
>>>>> Regards
>>>>>
>>>>> Sam
>>>>>
>>>>> *From: *Lazar, Lijo<Lijo.Lazar at amd.com>
>>>>> *Date: *Tuesday, May 6, 2025 at 19:55
>>>>> *To: *Zhang, GuoQing (Sam)<GuoQing.Zhang at amd.com>, amd-
>>>>> gfx at lists.freedesktop.org <amd-gfx at lists.freedesktop.org>
>>>>> *Cc: *Zhao, Victor<Victor.Zhao at amd.com>, Chang, HaiJun
>>>>> <HaiJun.Chang at amd.com>, Koenig, Christian<Christian.Koenig at amd.com>,
>>>>> Deucher, Alexander<Alexander.Deucher at amd.com>, Zhang, Owen(SRDC)
>>>>> <Owen.Zhang2 at amd.com>, Ma, Qing (Mark)<Qing.Ma at amd.com>, Jiang Liu
>>>>> <gerry at linux.alibaba.com>
>>>>> *Subject: *Re: [PATCH v3 1/7] drm/amdgpu: update XGMI physical node id
>>>>> and GMC configs on resume
>>>>>
>>>>>
>>>>>
>>>>> On 5/6/2025 3:06 PM, Samuel Zhang wrote:
>>>>>> For virtual machine with vGPUs in SRIOV single device mode and XGMI
>>>>>> is enabled, XGMI physical node ids may change when waking up from
>>>>>> hiberation with different vGPU devices. So update XGMI physical node
>>>>>> ids on resume.
>>>>>>
>>>>> Please keep in mind that this is not the only scenario addressed by the
>>>>> driver - for ex: a resume sequence is executed after a device reset.
>>>>> This patch itself introduces unwanted sequences for other commonly used
>>>>> usecases. Please rework on the series without breaking existing usecases.
>>>>>
>>>>> Thanks,
>>>>> Lijo
>>>>>
>>>>>> Update GPU memory controller configuration on resume if XGMI physical
>>>>>> node ids are changed.
>>>>>>
>>>>>> Signed-off-by: Jiang Liu<gerry at linux.alibaba.com>
>>>>>> Signed-off-by: Samuel Zhang<guoqing.zhang at amd.com>
>>>>>> ---
>>>>>> drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 24 ++++++++++++++++++++++
>>>>>> drivers/gpu/drm/amd/amdgpu/amdgpu_gmc.c | 3 +--
>>>>>> drivers/gpu/drm/amd/amdgpu/gmc_v9_0.c | 4 ++++
>>>>>> 3 files changed, 29 insertions(+), 2 deletions(-)
>>>>>>
>>>>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c b/drivers/gpu/
>>>>> drm/amd/amdgpu/amdgpu_device.c
>>>>>> index d477a901af84..e795af5067e5 100644
>>>>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>>>>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>>>>>> @@ -5040,6 +5040,27 @@ int amdgpu_device_suspend(struct drm_device
>>>>> *dev, bool notify_clients)
>>>>>> return 0;
>>>>>> }
>>>>>> +static int amdgpu_device_update_xgmi_info(struct amdgpu_device *adev)
>>>>>> +{
>>>>>> + int r;
>>>>>> + unsigned int prev_physical_node_id;
>>>>>> +
>>>>>> + /* Get xgmi info again for sriov to detect device changes */
>>>>>> + if (amdgpu_sriov_vf(adev) &&
>>>>>> + !(adev->flags & AMD_IS_APU) &&
>>>>>> + adev->gmc.xgmi.supported &&
>>>>>> + !adev->gmc.xgmi.connected_to_cpu) {
>>>>>> + prev_physical_node_id = adev->gmc.xgmi.physical_node_id;
>>>>>> + r = adev->gfxhub.funcs->get_xgmi_info(adev);
>>>>>> + if (r)
>>>>>> + return r;
>>>>>> +
>>>>>> + dev_info(adev->dev, "xgmi node, old id %d, new id %d\n",
>>>>>> + prev_physical_node_id, adev-
>>>>>> gmc.xgmi.physical_node_id);
>>>>>> + }
>>>>>> + return 0;
>>>>>> +}
>>>>>> +
>>>>>> /**
>>>>>> * amdgpu_device_resume - initiate device resume
>>>>>> *
>>>>>> @@ -5059,6 +5080,9 @@ int amdgpu_device_resume(struct drm_device *dev,
>>>>> bool notify_clients)
>>>>>> r = amdgpu_virt_request_full_gpu(adev, true);
>>>>>> if (r)
>>>>>> return r;
>>>>>> + r = amdgpu_device_update_xgmi_info(adev);
>>>>>> + if (r)
>>>>>> + return r;
>>>>>> }
>>>>>> if (dev->switch_power_state == DRM_SWITCH_POWER_OFF)
>>>>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_gmc.c b/drivers/gpu/
>>>>> drm/amd/amdgpu/amdgpu_gmc.c
>>>>>> index d1fa5e8e3937..a2abddf3c110 100644
>>>>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_gmc.c
>>>>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_gmc.c
>>>>>> @@ -1298,8 +1298,7 @@ int amdgpu_gmc_get_nps_memranges(struct
>>>>> amdgpu_device *adev,
>>>>>> if (!mem_ranges || !exp_ranges)
>>>>>> return -EINVAL;
>>>>>> - refresh = (adev->init_lvl->level !=
>>>>> AMDGPU_INIT_LEVEL_MINIMAL_XGMI) &&
>>>>>> - (adev->gmc.reset_flags & AMDGPU_GMC_INIT_RESET_NPS);
>>>>>> + refresh = true;
>>>>>> ret = amdgpu_discovery_get_nps_info(adev, &nps_type, &ranges,
>>>>>> &range_cnt, refresh);
>>>>>> diff --git a/drivers/gpu/drm/amd/amdgpu/gmc_v9_0.c b/drivers/gpu/drm/
>>>>> amd/amdgpu/gmc_v9_0.c
>>>>>> index 59385da80185..1eb451a3743b 100644
>>>>>> --- a/drivers/gpu/drm/amd/amdgpu/gmc_v9_0.c
>>>>>> +++ b/drivers/gpu/drm/amd/amdgpu/gmc_v9_0.c
>>>>>> @@ -2533,6 +2533,10 @@ static int gmc_v9_0_resume(struct
>>>>> amdgpu_ip_block *ip_block)
>>>>>> struct amdgpu_device *adev = ip_block->adev;
>>>>>> int r;
>>>>>> + r = gmc_v9_0_mc_init(adev);
>>>>>> + if (r)
>>>>>> + return r;
>>>>>> +
>>>>>> /* If a reset is done for NPS mode switch, read the memory range
>>>>>> * information again.
>>>>>> */
More information about the amd-gfx
mailing list