[PATCH v3 1/7] drm/amdgpu: update XGMI physical node id and GMC configs on resume

Wed May 7 12:49:18 UTC 2025

On 2025/5/7 20:21, Christian König wrote:
> On 5/7/25 13:03, Sam wrote:
>> On 2025/5/7 18:03, Lazar, Lijo wrote:
>>> On 5/7/2025 11:52 AM, Zhang, GuoQing (Sam) wrote:
>>>> [AMD Official Use Only - AMD Internal Distribution Only]
>>>>
>>>>
>>>>   
>>>>> Please keep in mind that this is not the only scenario addressed by the
>>>>> driver - for ex: a resume sequence is executed after a device reset.
>>>>> This patch itself introduces unwanted sequences for other commonly used
>>>>> usecases. Please rework on the series without breaking existing usecases.
>>>>> Thanks,
>>>>> Lijo
>>>>   
>>>> Hi @Lazar, Lijo<mailto:Lijo.Lazar at amd.com>, Thank you for the feedback.
>>>>
>>>>   
>>>> I also think the new code should be inside a check so that new code is
>>>> executed only on resume with different VF and do not break existing
>>>> usecases. Following is the implementation of this approach I can think of.
>>>>
>>>> - introduce new field `prev_physical_node_id ` in `struct amdgpu_xgmi `.
>>>> update the fields on resume.
>>>>
>>>> - put new code inside code block `if (prev_physical_node_id  !=
>>>> physical_node_id )`
>>>>
>>>>
>>> Can this happen only with XGMI under this condition? Any other method
>>> possible like preparing a 'unique signature' and matching it to identify
>>> if it resumed on an identically configured system?
>> Yes, this hibernate-resume with different VF feature is only for devices with XGMI. Detecting XGMI node id change is the only way I can think of to identify the case. It's also a very simple way.
>>
>> @Koenig, Christian <mailto:Christian.Koenig at amd.com> Are you OK with this approach, adding a check for the new code sequence?
>
> Well you still need to avoid calling gmc_v9_0_mc_init() since that is most likely incorrect.
Yes, I will change it to

     if (amdgpu_xmgi_is_node_changed(adev))
         gmc_v9_0_vram_gtt_location(adev, &adev->gmc);

And remove the change of `refresh`.

Regards
Sam

>
> Regards,
> Christian.
>
>>> Regardless, instead of having a direct check, better to wrap it inside
>>> something like
>>>      if (amdgpu_virt_need_migration()) or something more appropriate.
>> Yes, I will do that. Thank you!
>>
>> Regards
>> Sam
>>
>>>    Thanks,
>>> Lijo
>>>
>>>> Is this approach acceptable? If not, can you suggest a better approach?
>>>> @Lazar, Lijo<mailto:Lijo.Lazar at amd.com> @Koenig, Christian
>>>> <mailto:Christian.Koenig at amd.com> Thank you!
>>>>
>>>>   
>>>> Regards
>>>>
>>>> Sam
>>>>
>>>>   
>>>> *From: *Lazar, Lijo<Lijo.Lazar at amd.com>
>>>> *Date: *Tuesday, May 6, 2025 at 19:55
>>>> *To: *Zhang, GuoQing (Sam)<GuoQing.Zhang at amd.com>, amd-
>>>> gfx at lists.freedesktop.org <amd-gfx at lists.freedesktop.org>
>>>> *Cc: *Zhao, Victor<Victor.Zhao at amd.com>, Chang, HaiJun
>>>> <HaiJun.Chang at amd.com>, Koenig, Christian<Christian.Koenig at amd.com>,
>>>> Deucher, Alexander<Alexander.Deucher at amd.com>, Zhang, Owen(SRDC)
>>>> <Owen.Zhang2 at amd.com>, Ma, Qing (Mark)<Qing.Ma at amd.com>, Jiang Liu
>>>> <gerry at linux.alibaba.com>
>>>> *Subject: *Re: [PATCH v3 1/7] drm/amdgpu: update XGMI physical node id
>>>> and GMC configs on resume
>>>>
>>>>
>>>>
>>>> On 5/6/2025 3:06 PM, Samuel Zhang wrote:
>>>>> For virtual machine with vGPUs in SRIOV single device mode and XGMI
>>>>> is enabled, XGMI physical node ids may change when waking up from
>>>>> hiberation with different vGPU devices. So update XGMI physical node
>>>>> ids on resume.
>>>>>
>>>> Please keep in mind that this is not the only scenario addressed by the
>>>> driver - for ex: a resume sequence is executed after a device reset.
>>>> This patch itself introduces unwanted sequences for other commonly used
>>>> usecases. Please rework on the series without breaking existing usecases.
>>>>
>>>> Thanks,
>>>> Lijo
>>>>
>>>>> Update GPU memory controller configuration on resume if XGMI physical
>>>>> node ids are changed.
>>>>>
>>>>> Signed-off-by: Jiang Liu<gerry at linux.alibaba.com>
>>>>> Signed-off-by: Samuel Zhang<guoqing.zhang at amd.com>
>>>>> ---
>>>>>     drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 24 ++++++++++++++++++++++
>>>>>     drivers/gpu/drm/amd/amdgpu/amdgpu_gmc.c    |  3 +--
>>>>>     drivers/gpu/drm/amd/amdgpu/gmc_v9_0.c      |  4 ++++
>>>>>     3 files changed, 29 insertions(+), 2 deletions(-)
>>>>>
>>>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c b/drivers/gpu/
>>>> drm/amd/amdgpu/amdgpu_device.c
>>>>> index d477a901af84..e795af5067e5 100644
>>>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>>>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>>>>> @@ -5040,6 +5040,27 @@ int amdgpu_device_suspend(struct drm_device
>>>> *dev, bool notify_clients)
>>>>>           return 0;
>>>>>     }
>>>>>    +static int amdgpu_device_update_xgmi_info(struct amdgpu_device *adev)
>>>>> +{
>>>>> +     int r;
>>>>> +     unsigned int prev_physical_node_id;
>>>>> +
>>>>> +     /* Get xgmi info again for sriov to detect device changes */
>>>>> +     if (amdgpu_sriov_vf(adev) &&
>>>>> +         !(adev->flags & AMD_IS_APU) &&
>>>>> +         adev->gmc.xgmi.supported &&
>>>>> +         !adev->gmc.xgmi.connected_to_cpu) {
>>>>> +             prev_physical_node_id = adev->gmc.xgmi.physical_node_id;
>>>>> +             r = adev->gfxhub.funcs->get_xgmi_info(adev);
>>>>> +             if (r)
>>>>> +                     return r;
>>>>> +
>>>>> +             dev_info(adev->dev, "xgmi node, old id %d, new id %d\n",
>>>>> +                     prev_physical_node_id, adev-
>>>>> gmc.xgmi.physical_node_id);
>>>>> +     }
>>>>> +     return 0;
>>>>> +}
>>>>> +
>>>>>     /**
>>>>>      * amdgpu_device_resume - initiate device resume
>>>>>      *
>>>>> @@ -5059,6 +5080,9 @@ int amdgpu_device_resume(struct drm_device *dev,
>>>> bool notify_clients)
>>>>>                   r = amdgpu_virt_request_full_gpu(adev, true);
>>>>>                   if (r)
>>>>>                           return r;
>>>>> +             r = amdgpu_device_update_xgmi_info(adev);
>>>>> +             if (r)
>>>>> +                     return r;
>>>>>           }
>>>>>             if (dev->switch_power_state == DRM_SWITCH_POWER_OFF)
>>>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_gmc.c b/drivers/gpu/
>>>> drm/amd/amdgpu/amdgpu_gmc.c
>>>>> index d1fa5e8e3937..a2abddf3c110 100644
>>>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_gmc.c
>>>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_gmc.c
>>>>> @@ -1298,8 +1298,7 @@ int amdgpu_gmc_get_nps_memranges(struct
>>>> amdgpu_device *adev,
>>>>>           if (!mem_ranges || !exp_ranges)
>>>>>                   return -EINVAL;
>>>>>    -     refresh = (adev->init_lvl->level !=
>>>> AMDGPU_INIT_LEVEL_MINIMAL_XGMI) &&
>>>>> -               (adev->gmc.reset_flags & AMDGPU_GMC_INIT_RESET_NPS);
>>>>> +     refresh = true;
>>>>>           ret = amdgpu_discovery_get_nps_info(adev, &nps_type, &ranges,
>>>>>                                               &range_cnt, refresh);
>>>>>    diff --git a/drivers/gpu/drm/amd/amdgpu/gmc_v9_0.c b/drivers/gpu/drm/
>>>> amd/amdgpu/gmc_v9_0.c
>>>>> index 59385da80185..1eb451a3743b 100644
>>>>> --- a/drivers/gpu/drm/amd/amdgpu/gmc_v9_0.c
>>>>> +++ b/drivers/gpu/drm/amd/amdgpu/gmc_v9_0.c
>>>>> @@ -2533,6 +2533,10 @@ static int gmc_v9_0_resume(struct
>>>> amdgpu_ip_block *ip_block)
>>>>>           struct amdgpu_device *adev = ip_block->adev;
>>>>>           int r;
>>>>>    +     r = gmc_v9_0_mc_init(adev);
>>>>> +     if (r)
>>>>> +             return r;
>>>>> +
>>>>>           /* If a reset is done for NPS mode switch, read the memory range
>>>>>            * information again.
>>>>>            */