[PATCH 1/4] drm/amdgpu: skip reset other device in the same hive if it's SRIOV VF

Tue Dec 7 21:55:11 UTC 2021

[AMD Official Use Only]

Shaoyun, please see my comments inline.

Thanks,
Zhigang

-----Original Message-----
From: Liu, Shaoyun <Shaoyun.Liu at amd.com> 
Sent: December 7, 2021 2:15 PM
To: Luo, Zhigang <Zhigang.Luo at amd.com>; amd-gfx at lists.freedesktop.org
Cc: Luo, Zhigang <Zhigang.Luo at amd.com>
Subject: RE: [PATCH 1/4] drm/amdgpu: skip reset other device in the same hive if it's SRIOV VF

[AMD Official Use Only]

This   patch looks ok  to me . 
Patch 2 is  actually add the PSP xgmi init  not the whole XGMI  init, can  you change the description according  to this ? 
[Zhigang] Ok. Will change it.
Patch 3,  You take the hive lock inside the reset sriov function , but the  hive lock already be took  before this function is called  in gpu_recovery function,  so is it real necessary to get hive  inside the reset sriov function , can  you try remove the code to check hive ?  Or maybe pass the  hive as a parameter into this function if the hive is needed? 
[Zhigang] in patch 1, we made change in gpu_recovery to skip getting xgmi hive if it's sriov vf as we don't want to reset other VF in the same hive.
Patch 4 looks ok to me , but may need  SRDC engineer confirm it won't have  side effect on other AI  asic . 

Regards
Shaoyun.liu

-----Original Message-----
From: amd-gfx <amd-gfx-bounces at lists.freedesktop.org> On Behalf Of Zhigang Luo
Sent: Tuesday, December 7, 2021 11:57 AM
To: amd-gfx at lists.freedesktop.org
Cc: Luo, Zhigang <Zhigang.Luo at amd.com>
Subject: [PATCH 1/4] drm/amdgpu: skip reset other device in the same hive if it's SRIOV VF

On SRIOV, host driver can support FLR(function level reset) on individual VF within the hive which might bring the individual device back to normal without the necessary to execute the hive reset. If the FLR failed , host driver will trigger the hive reset, each guest VF will get reset notification before the real hive reset been executed. The VF device can handle the reset request individually in it's reset work handler.

This change updated gpu recover sequence to skip reset other device in the same hive for SRIOV VF.

Signed-off-by: Zhigang Luo <zhigang.luo at amd.com>
---
 drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 7 ++++---
 1 file changed, 4 insertions(+), 3 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
index 3c5afa45173c..474f8ea58aa5 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
@@ -4746,7 +4746,7 @@ static int amdgpu_device_lock_hive_adev(struct amdgpu_device *adev, struct amdgp  {
 	struct amdgpu_device *tmp_adev = NULL;
 
-	if (adev->gmc.xgmi.num_physical_nodes > 1) {
+	if (!amdgpu_sriov_vf(adev) && (adev->gmc.xgmi.num_physical_nodes > 1)) 
+{
 		if (!hive) {
 			dev_err(adev->dev, "Hive is NULL while device has multiple xgmi nodes");
 			return -ENODEV;
@@ -4958,7 +4958,8 @@ int amdgpu_device_gpu_recover(struct amdgpu_device *adev,
 	 * We always reset all schedulers for device and all devices for XGMI
 	 * hive so that should take care of them too.
 	 */
-	hive = amdgpu_get_xgmi_hive(adev);
+	if (!amdgpu_sriov_vf(adev))
+		hive = amdgpu_get_xgmi_hive(adev);
 	if (hive) {
 		if (atomic_cmpxchg(&hive->in_reset, 0, 1) != 0) {
 			DRM_INFO("Bailing on TDR for s_job:%llx, hive: %llx as another already in progress", @@ -4999,7 +5000,7 @@ int amdgpu_device_gpu_recover(struct amdgpu_device *adev,
 	 * to put adev in the 1st position.
 	 */
 	INIT_LIST_HEAD(&device_list);
-	if (adev->gmc.xgmi.num_physical_nodes > 1) {
+	if (!amdgpu_sriov_vf(adev) && (adev->gmc.xgmi.num_physical_nodes > 1)) 
+{
 		list_for_each_entry(tmp_adev, &hive->device_list, gmc.xgmi.head)
 			list_add_tail(&tmp_adev->reset_list, &device_list);
 		if (!list_is_first(&adev->reset_list, &device_list))
--
2.17.1