<html xmlns:o="urn:schemas-microsoft-com:office:office" xmlns:w="urn:schemas-microsoft-com:office:word" xmlns:m="http://schemas.microsoft.com/office/2004/12/omml" xmlns="http://www.w3.org/TR/REC-html40">
<head>
<meta http-equiv="Content-Type" content="text/html; charset=iso-8859-1">
<meta name="Generator" content="Microsoft Word 15 (filtered medium)">
<style><!--
/* Font Definitions */
@font-face
{font-family:"Cambria Math";
panose-1:2 4 5 3 5 4 6 3 2 4;}
@font-face
{font-family:Calibri;
panose-1:2 15 5 2 2 2 4 3 2 4;}
/* Style Definitions */
p.MsoNormal, li.MsoNormal, div.MsoNormal
{margin:0cm;
font-size:11.0pt;
font-family:"Calibri",sans-serif;}
a:link, span.MsoHyperlink
{mso-style-priority:99;
color:blue;
text-decoration:underline;}
.MsoChpDefault
{mso-style-type:export-only;
font-size:10.0pt;
mso-ligatures:none;}
@page WordSection1
{size:612.0pt 792.0pt;
margin:72.0pt 72.0pt 72.0pt 72.0pt;}
div.WordSection1
{page:WordSection1;}
--></style>
</head>
<body lang="en-CN" link="blue" vlink="purple" style="word-wrap:break-word">
<div class="WordSection1">
<div>
<p class="MsoNormal"><br>
On 2025/5/7 20:56, Christian König wrote:<br>
> On 5/7/25 14:49, Sam wrote:<br>
>> On 2025/5/7 20:21, Christian König wrote:<br>
>>> On 5/7/25 13:03, Sam wrote:<br>
>>>> On 2025/5/7 18:03, Lazar, Lijo wrote:<br>
>>>>> On 5/7/2025 11:52 AM, Zhang, GuoQing (Sam) wrote:<br>
>>>>>> [AMD Official Use Only - AMD Internal Distribution Only]<br>
>>>>>><br>
>>>>>><br>
>>>>>> <br>
>>>>>>> Please keep in mind that this is not the only scenario addressed by the<br>
>>>>>>> driver - for ex: a resume sequence is executed after a device reset.<br>
>>>>>>> This patch itself introduces unwanted sequences for other commonly used<br>
>>>>>>> usecases. Please rework on the series without breaking existing usecases.<br>
>>>>>>> Thanks,<br>
>>>>>>> Lijo<br>
>>>>>> Hi @Lazar, Lijo<<a href="mailto:Lijo.Lazar@amd.com">mailto:Lijo.Lazar@amd.com</a>>, Thank you for the feedback.<br>
>>>>>><br>
>>>>>> I also think the new code should be inside a check so that new code is<br>
>>>>>> executed only on resume with different VF and do not break existing<br>
>>>>>> usecases. Following is the implementation of this approach I can think of.<br>
>>>>>><br>
>>>>>> - introduce new field `prev_physical_node_id ` in `struct amdgpu_xgmi `.<br>
>>>>>> update the fields on resume.<br>
>>>>>><br>
>>>>>> - put new code inside code block `if (prev_physical_node_id !=<br>
>>>>>> physical_node_id )`<br>
>>>>>><br>
>>>>>><br>
>>>>> Can this happen only with XGMI under this condition? Any other method<br>
>>>>> possible like preparing a 'unique signature' and matching it to identify<br>
>>>>> if it resumed on an identically configured system?<br>
>>>> Yes, this hibernate-resume with different VF feature is only for devices with XGMI. Detecting XGMI node id change is the only way I can think of to identify the case. It's also a very simple way.<br>
>>>><br>
>>>> @Koenig, Christian<<a href="mailto:Christian.Koenig@amd.com">mailto:Christian.Koenig@amd.com</a>> Are you OK with this approach, adding a check for the new code sequence?<br>
>>> Well you still need to avoid calling gmc_v9_0_mc_init() since that is most likely incorrect.<br>
>> Yes, I will change it to<br>
>><br>
>> if (amdgpu_xmgi_is_node_changed(adev))<br>
>> gmc_v9_0_vram_gtt_location(adev, &adev->gmc);<br>
> Even that is incorrect. The VRAM and GTT location can't change on resume.<br>
><br>
> What changes are the XGMI node ID and with it where inside the FB aperture our VRAM PDB0 should point to.<br>
<br>
<br>
2 updates in `gmc_v9_0_vram_gtt_location()` is needed:<br>
1. `vm_manager.vram_base_offset` is changed with new `xgmi.physical_node_id` at the end of `gmc_v9_0_vram_gtt_location()`.<br>
2. `fb_start` and `fb_end` got reset in `mmhub_v1_8_get_fb_location()` called by the new `amdgpu_bo_fb_aper_addr()`. It needs to be updated again.<br>
<br>
<o:p></o:p></p>
<p class="MsoNormal"><span lang="EN-US">For update 1, I can inline it in </span>`gmc_v9_0_resume()`<span lang="EN-US">.<o:p></o:p></span></p>
<p class="MsoNormal"><span lang="EN-US">For update 2, I can disable reset `fb_start/fb_end` in
</span>`mmhub_v1_8_get_fb_location()`<span lang="EN-US"> when pdb0 is enabled.<o:p></o:p></span></p>
<p class="MsoNormal"><span lang="EN-US">Is this OK?</span> @Koenig, Christian <<a href="mailto:Christian.Koenig@amd.com">mailto:Christian.Koenig@amd.com</a>><br>
<br>
Regards<br>
Sam<br>
<br>
<br>
> Regards,<br>
> Christian.<br>
><br>
>> And remove the change of `refresh`.<br>
>><br>
>> Regards<br>
>> Sam<br>
>><br>
>><br>
>>> Regards,<br>
>>> Christian.<br>
>>><br>
>>>>> Regardless, instead of having a direct check, better to wrap it inside<br>
>>>>> something like<br>
>>>>> if (amdgpu_virt_need_migration()) or something more appropriate.<br>
>>>> Yes, I will do that. Thank you!<br>
>>>><br>
>>>> Regards<br>
>>>> Sam<br>
>>>><br>
>>>>> Thanks,<br>
>>>>> Lijo<br>
>>>>><br>
>>>>>> Is this approach acceptable? If not, can you suggest a better approach?<br>
>>>>>> @Lazar, Lijo<<a href="mailto:Lijo.Lazar@amd.com">mailto:Lijo.Lazar@amd.com</a>> @Koenig, Christian<br>
>>>>>> <<a href="mailto:Christian.Koenig@amd.com">mailto:Christian.Koenig@amd.com</a>> Thank you!<br>
>>>>>><br>
>>>>>> Regards<br>
>>>>>><br>
>>>>>> Sam<br>
>>>>>><br>
>>>>>> *From: *Lazar, Lijo<Lijo.Lazar@amd.com><br>
>>>>>> *Date: *Tuesday, May 6, 2025 at 19:55<br>
>>>>>> *To: *Zhang, GuoQing (Sam)<GuoQing.Zhang@amd.com>, amd-<br>
>>>>>> gfx@lists.freedesktop.org <amd-gfx@lists.freedesktop.org><br>
>>>>>> *Cc: *Zhao, Victor<Victor.Zhao@amd.com>, Chang, HaiJun<br>
>>>>>> <HaiJun.Chang@amd.com>, Koenig, Christian<Christian.Koenig@amd.com>,<br>
>>>>>> Deucher, Alexander<Alexander.Deucher@amd.com>, Zhang, Owen(SRDC)<br>
>>>>>> <Owen.Zhang2@amd.com>, Ma, Qing (Mark)<Qing.Ma@amd.com>, Jiang Liu<br>
>>>>>> <gerry@linux.alibaba.com><br>
>>>>>> *Subject: *Re: [PATCH v3 1/7] drm/amdgpu: update XGMI physical node id<br>
>>>>>> and GMC configs on resume<br>
>>>>>><br>
>>>>>><br>
>>>>>><br>
>>>>>> On 5/6/2025 3:06 PM, Samuel Zhang wrote:<br>
>>>>>>> For virtual machine with vGPUs in SRIOV single device mode and XGMI<br>
>>>>>>> is enabled, XGMI physical node ids may change when waking up from<br>
>>>>>>> hiberation with different vGPU devices. So update XGMI physical node<br>
>>>>>>> ids on resume.<br>
>>>>>>><br>
>>>>>> Please keep in mind that this is not the only scenario addressed by the<br>
>>>>>> driver - for ex: a resume sequence is executed after a device reset.<br>
>>>>>> This patch itself introduces unwanted sequences for other commonly used<br>
>>>>>> usecases. Please rework on the series without breaking existing usecases.<br>
>>>>>><br>
>>>>>> Thanks,<br>
>>>>>> Lijo<br>
>>>>>><br>
>>>>>>> Update GPU memory controller configuration on resume if XGMI physical<br>
>>>>>>> node ids are changed.<br>
>>>>>>><br>
>>>>>>> Signed-off-by: Jiang Liu<gerry@linux.alibaba.com><br>
>>>>>>> Signed-off-by: Samuel Zhang<guoqing.zhang@amd.com><br>
>>>>>>> ---<br>
>>>>>>> drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 24 ++++++++++++++++++++++<br>
>>>>>>> drivers/gpu/drm/amd/amdgpu/amdgpu_gmc.c | 3 +--<br>
>>>>>>> drivers/gpu/drm/amd/amdgpu/gmc_v9_0.c | 4 ++++<br>
>>>>>>> 3 files changed, 29 insertions(+), 2 deletions(-)<br>
>>>>>>><br>
>>>>>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c b/drivers/gpu/<br>
>>>>>> drm/amd/amdgpu/amdgpu_device.c<br>
>>>>>>> index d477a901af84..e795af5067e5 100644<br>
>>>>>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c<br>
>>>>>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c<br>
>>>>>>> @@ -5040,6 +5040,27 @@ int amdgpu_device_suspend(struct drm_device<br>
>>>>>> *dev, bool notify_clients)<br>
>>>>>>> return 0;<br>
>>>>>>> }<br>
>>>>>>> +static int amdgpu_device_update_xgmi_info(struct amdgpu_device *adev)<br>
>>>>>>> +{<br>
>>>>>>> + int r;<br>
>>>>>>> + unsigned int prev_physical_node_id;<br>
>>>>>>> +<br>
>>>>>>> + /* Get xgmi info again for sriov to detect device changes */<br>
>>>>>>> + if (amdgpu_sriov_vf(adev) &&<br>
>>>>>>> + !(adev->flags & AMD_IS_APU) &&<br>
>>>>>>> + adev->gmc.xgmi.supported &&<br>
>>>>>>> + !adev->gmc.xgmi.connected_to_cpu) {<br>
>>>>>>> + prev_physical_node_id = adev->gmc.xgmi.physical_node_id;<br>
>>>>>>> + r = adev->gfxhub.funcs->get_xgmi_info(adev);<br>
>>>>>>> + if (r)<br>
>>>>>>> + return r;<br>
>>>>>>> +<br>
>>>>>>> + dev_info(adev->dev, "xgmi node, old id %d, new id %d\n",<br>
>>>>>>> + prev_physical_node_id, adev-<br>
>>>>>>> gmc.xgmi.physical_node_id);<br>
>>>>>>> + }<br>
>>>>>>> + return 0;<br>
>>>>>>> +}<br>
>>>>>>> +<br>
>>>>>>> /**<br>
>>>>>>> * amdgpu_device_resume - initiate device resume<br>
>>>>>>> *<br>
>>>>>>> @@ -5059,6 +5080,9 @@ int amdgpu_device_resume(struct drm_device *dev,<br>
>>>>>> bool notify_clients)<br>
>>>>>>> r = amdgpu_virt_request_full_gpu(adev, true);<br>
>>>>>>> if (r)<br>
>>>>>>> return r;<br>
>>>>>>> + r = amdgpu_device_update_xgmi_info(adev);<br>
>>>>>>> + if (r)<br>
>>>>>>> + return r;<br>
>>>>>>> }<br>
>>>>>>> if (dev->switch_power_state == DRM_SWITCH_POWER_OFF)<br>
>>>>>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_gmc.c b/drivers/gpu/<br>
>>>>>> drm/amd/amdgpu/amdgpu_gmc.c<br>
>>>>>>> index d1fa5e8e3937..a2abddf3c110 100644<br>
>>>>>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_gmc.c<br>
>>>>>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_gmc.c<br>
>>>>>>> @@ -1298,8 +1298,7 @@ int amdgpu_gmc_get_nps_memranges(struct<br>
>>>>>> amdgpu_device *adev,<br>
>>>>>>> if (!mem_ranges || !exp_ranges)<br>
>>>>>>> return -EINVAL;<br>
>>>>>>> - refresh = (adev->init_lvl->level !=<br>
>>>>>> AMDGPU_INIT_LEVEL_MINIMAL_XGMI) &&<br>
>>>>>>> - (adev->gmc.reset_flags & AMDGPU_GMC_INIT_RESET_NPS);<br>
>>>>>>> + refresh = true;<br>
>>>>>>> ret = amdgpu_discovery_get_nps_info(adev, &nps_type, &ranges,<br>
>>>>>>> &range_cnt, refresh);<br>
>>>>>>> diff --git a/drivers/gpu/drm/amd/amdgpu/gmc_v9_0.c b/drivers/gpu/drm/<br>
>>>>>> amd/amdgpu/gmc_v9_0.c<br>
>>>>>>> index 59385da80185..1eb451a3743b 100644<br>
>>>>>>> --- a/drivers/gpu/drm/amd/amdgpu/gmc_v9_0.c<br>
>>>>>>> +++ b/drivers/gpu/drm/amd/amdgpu/gmc_v9_0.c<br>
>>>>>>> @@ -2533,6 +2533,10 @@ static int gmc_v9_0_resume(struct<br>
>>>>>> amdgpu_ip_block *ip_block)<br>
>>>>>>> struct amdgpu_device *adev = ip_block->adev;<br>
>>>>>>> int r;<br>
>>>>>>> + r = gmc_v9_0_mc_init(adev);<br>
>>>>>>> + if (r)<br>
>>>>>>> + return r;<br>
>>>>>>> +<br>
>>>>>>> /* If a reset is done for NPS mode switch, read the memory range<br>
>>>>>>> * information again.<br>
>>>>>>> */<o:p></o:p></p>
</div>
</div>
</body>
</html>