<html xmlns:o="urn:schemas-microsoft-com:office:office" xmlns:w="urn:schemas-microsoft-com:office:word" xmlns:m="http://schemas.microsoft.com/office/2004/12/omml" xmlns="http://www.w3.org/TR/REC-html40">
<head>
<meta http-equiv="Content-Type" content="text/html; charset=iso-8859-1">
<meta name="Generator" content="Microsoft Word 15 (filtered medium)">
<style><!--
/* Font Definitions */
@font-face
{font-family:"Cambria Math";
panose-1:2 4 5 3 5 4 6 3 2 4;}
@font-face
{font-family:Calibri;
panose-1:2 15 5 2 2 2 4 3 2 4;}
/* Style Definitions */
p.MsoNormal, li.MsoNormal, div.MsoNormal
{margin:0cm;
font-size:11.0pt;
font-family:"Calibri",sans-serif;}
a:link, span.MsoHyperlink
{mso-style-priority:99;
color:blue;
text-decoration:underline;}
.MsoChpDefault
{mso-style-type:export-only;
font-size:10.0pt;
mso-ligatures:none;}
@page WordSection1
{size:612.0pt 792.0pt;
margin:72.0pt 72.0pt 72.0pt 72.0pt;}
div.WordSection1
{page:WordSection1;}
--></style>
</head>
<body lang="en-CN" link="blue" vlink="purple" style="word-wrap:break-word">
<div class="WordSection1">
<div>
<p class="MsoNormal"><br>
On 2025/5/8 18:56, Christian König wrote:<br>
> On 5/8/25 10:12, Lazar, Lijo wrote:<br>
>><br>
>> On 5/8/2025 10:39 AM, Samuel Zhang wrote:<br>
>>> For virtual machine with vGPUs in SRIOV single device mode and XGMI<br>
>>> is enabled, XGMI physical node ids may change when waking up from<br>
>>> hiberation with different vGPU devices. So update XGMI info on resume.<br>
>>><br>
>>> Signed-off-by: Jiang Liu <gerry@linux.alibaba.com><br>
>>> Signed-off-by: Samuel Zhang <guoqing.zhang@amd.com><br>
>>> ---<br>
>>> drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 24 ++++++++++++++++++++++<br>
>>> drivers/gpu/drm/amd/amdgpu/amdgpu_xgmi.h | 4 ++++<br>
>>> drivers/gpu/drm/amd/amdgpu/gmc_v9_0.c | 6 ++++++<br>
>>> 3 files changed, 34 insertions(+)<br>
>>><br>
>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c<br>
>>> index d477a901af84..843a3b0a9a07 100644<br>
>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c<br>
>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c<br>
>>> @@ -4478,6 +4478,7 @@ int amdgpu_device_init(struct amdgpu_device *adev,<br>
>>> r = adev->gfxhub.funcs->get_xgmi_info(adev);<br>
>>> if (r)<br>
>>> return r;<br>
>>> + adev->gmc.xgmi.prev_physical_node_id = adev->gmc.xgmi.physical_node_id;<br>
>>> }<br>
>>> <br>
>>> /* enable PCIE atomic ops */<br>
>>> @@ -5040,6 +5041,26 @@ int amdgpu_device_suspend(struct drm_device *dev, bool notify_clients)<br>
>>> return 0;<br>
>>> }<br>
>>> <br>
>>> +static int amdgpu_device_update_xgmi_info(struct amdgpu_device *adev)<br>
>>> +{<br>
>>> + int r;<br>
>>> +<br>
>>> + /* Get xgmi info again for sriov to detect device changes */<br>
>>> + if (amdgpu_sriov_vf(adev) &&<br>
>>> + !(adev->flags & AMD_IS_APU) &&<br>
>>> + adev->gmc.xgmi.supported &&<br>
>>> + !adev->gmc.xgmi.connected_to_cpu) {<br>
>>> + adev->gmc.xgmi.prev_physical_node_id = adev->gmc.xgmi.physical_node_id;<br>
>>> + r = adev->gfxhub.funcs->get_xgmi_info(adev);<br>
>>> + if (r)<br>
>>> + return r;<br>
>>> +<br>
>>> + dev_info(adev->dev, "xgmi node, old id %d, new id %d\n",<br>
>>> + adev->gmc.xgmi.prev_physical_node_id, adev->gmc.xgmi.physical_node_id);<br>
>>> + }<br>
>>> + return 0;<br>
>>> +}<br>
>>> +<br>
>>> /**<br>
>>> * amdgpu_device_resume - initiate device resume<br>
>>> *<br>
>>> @@ -5059,6 +5080,9 @@ int amdgpu_device_resume(struct drm_device *dev, bool notify_clients)<br>
>>> r = amdgpu_virt_request_full_gpu(adev, true);<br>
>>> if (r)<br>
>>> return r;<br>
>>> + r = amdgpu_device_update_xgmi_info(adev);<br>
>>> + if (r)<br>
>>> + return r;<br>
>>> }<br>
>>> <br>
>>> if (dev->switch_power_state == DRM_SWITCH_POWER_OFF)<br>
>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_xgmi.h b/drivers/gpu/drm/amd/amdgpu/amdgpu_xgmi.h<br>
>>> index 32dabba4062f..1387901576f1 100644<br>
>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_xgmi.h<br>
>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_xgmi.h<br>
>>> @@ -89,6 +89,7 @@ struct amdgpu_xgmi {<br>
>>> u64 node_segment_size;<br>
>>> /* physical node (0-3) */<br>
>>> unsigned physical_node_id;<br>
>>> + unsigned prev_physical_node_id;<br>
>>> /* number of nodes (0-4) */<br>
>>> unsigned num_physical_nodes;<br>
>>> /* gpu list in the same hive */<br>
>>> @@ -101,6 +102,9 @@ struct amdgpu_xgmi {<br>
>>> uint8_t max_width;<br>
>>> };<br>
>>> <br>
>>> +#define amdgpu_xmgi_is_node_changed(adev) \<br>
>> Typo - xgmi<br>
>><br>
>>> + (adev->gmc.xgmi.prev_physical_node_id != adev->gmc.xgmi.physical_node_id)<br>
>> Since prev_physical_node_id is updated only for VF, the check should be<br>
>> there here as well.<br>
>><br>
>> Otherwise, you may have something like below in<br>
>> amdgpu_device_update_xgmi_info()<br>
>><br>
>> amdgpu_xgmi.node_changed = false;<br>
>> if (check_condition) {<br>
>> prev_node = adev->gmc.xgmi.physical_node_id;<br>
>> adev->gfxhub.funcs->get_xgmi_info(adev)<br>
>> amdgpu_xgmi.node_changed = (prev_node != adev->gmc.xgmi.physical_node_id);<br>
>> }<br>
>><br>
>> To make it clearer -<br>
>><br>
>> Would still prefer to wrap under amdgpu_virt_migration_xyz() to make it<br>
>> clear that this is done for node migration.<br>
><br>
> Yeah, that is a rather good idea as well.<br>
><br>
> And we should *always* execute that and not just when the physical node id changes.<br>
><br>
> Otherwise we won't always test the code path and potentially break it at some point.<br>
<br>
<br>
Hi @Koenig, Christian <<a href="mailto:Christian.Koenig@amd.com">mailto:Christian.Koenig@amd.com</a>>, thank you for
<br>
the explanation.<br>
<br>
Hi @Lazar, Lijo <<a href="mailto:Lijo.Lazar@amd.com">mailto:Lijo.Lazar@amd.com</a>>, are you OK with removing
<br>
the check? I can do reset test using quark in VM to ensure no regression <br>
is introduced. Please comment if you still have different opinion. Thank <br>
you!<br>
<br>
Regards<br>
Sam<br>
<br>
<br>
> Regards,<br>
> Christian.<br>
><br>
>> Ex:<br>
>><br>
>> bool amdgpu_virt_migration_detected()<br>
>> {<br>
>> return amdgpu_xgmi.node_changed; // And any other combination checks<br>
>> which could up in future.<br>
>> }<br>
>><br>
>> The check needs to be done for any further changes down the series like<br>
>><br>
>> if (amdgpu_virt_migration_detected())<br>
>> psp_update_gpu_addresses();<br>
>><br>
>> Thanks,<br>
>> Lijo<br>
>><br>
>>> +<br>
>>> struct amdgpu_hive_info *amdgpu_get_xgmi_hive(struct amdgpu_device *adev);<br>
>>> void amdgpu_put_xgmi_hive(struct amdgpu_hive_info *hive);<br>
>>> int amdgpu_xgmi_update_topology(struct amdgpu_hive_info *hive, struct amdgpu_device *adev);<br>
>>> diff --git a/drivers/gpu/drm/amd/amdgpu/gmc_v9_0.c b/drivers/gpu/drm/amd/amdgpu/gmc_v9_0.c<br>
>>> index 59385da80185..7c0ca2721eb3 100644<br>
>>> --- a/drivers/gpu/drm/amd/amdgpu/gmc_v9_0.c<br>
>>> +++ b/drivers/gpu/drm/amd/amdgpu/gmc_v9_0.c<br>
>>> @@ -2533,6 +2533,12 @@ static int gmc_v9_0_resume(struct amdgpu_ip_block *ip_block)<br>
>>> struct amdgpu_device *adev = ip_block->adev;<br>
>>> int r;<br>
>>> <br>
>>> + if (amdgpu_xmgi_is_node_changed(adev)) {<br>
>>> + adev->vm_manager.vram_base_offset = adev->gfxhub.funcs->get_mc_fb_offset(adev);<br>
>>> + adev->vm_manager.vram_base_offset +=<br>
>>> + adev->gmc.xgmi.physical_node_id * adev->gmc.xgmi.node_segment_size;<br>
>>> + }<br>
>>> +<br>
>>> /* If a reset is done for NPS mode switch, read the memory range<br>
>>> * information again.<br>
>>> */<o:p></o:p></p>
</div>
</div>
</body>
</html>