[PATCH] drm/amdgpu: skip reset other device in the same hive if it's sriov vf
Liu, Shaoyun
Shaoyun.Liu at amd.com
Sat Dec 4 02:37:36 UTC 2021
[AMD Official Use Only]
I think you need to describe more details on why the hive reset on guest side is not necessary and how host and guest driver will work together to handle the hive reset . You should have 2 patches together as a serials to handle the FLR and mode 1 reset on XGMI configuration.
The describe is something like :
On SRIOV, host driver can support FLR(function level reset) on individual VF within the hive which might bring the individual device back to normal without the necessary to execute the hive reset. If the FLR failed , host driver will trigger the hive reset , each guest VF will get reset notification before the real hive reset been executed . The VF device can handle the reset request individually in it's reset work handler .
-----Original Message-----
From: amd-gfx <amd-gfx-bounces at lists.freedesktop.org> On Behalf Of Zhigang Luo
Sent: Friday, December 3, 2021 5:06 PM
To: amd-gfx at lists.freedesktop.org
Cc: Luo, Zhigang <Zhigang.Luo at amd.com>
Subject: [PATCH] drm/amdgpu: skip reset other device in the same hive if it's sriov vf
For sriov vf hang, vf flr will be triggered. Hive reset is not needed.
Signed-off-by: Zhigang Luo <zhigang.luo at amd.com>
---
drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 7 ++++---
1 file changed, 4 insertions(+), 3 deletions(-)
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
index 3c5afa45173c..474f8ea58aa5 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
@@ -4746,7 +4746,7 @@ static int amdgpu_device_lock_hive_adev(struct amdgpu_device *adev, struct amdgp {
struct amdgpu_device *tmp_adev = NULL;
- if (adev->gmc.xgmi.num_physical_nodes > 1) {
+ if (!amdgpu_sriov_vf(adev) && (adev->gmc.xgmi.num_physical_nodes > 1))
+{
if (!hive) {
dev_err(adev->dev, "Hive is NULL while device has multiple xgmi nodes");
return -ENODEV;
@@ -4958,7 +4958,8 @@ int amdgpu_device_gpu_recover(struct amdgpu_device *adev,
* We always reset all schedulers for device and all devices for XGMI
* hive so that should take care of them too.
*/
- hive = amdgpu_get_xgmi_hive(adev);
+ if (!amdgpu_sriov_vf(adev))
+ hive = amdgpu_get_xgmi_hive(adev);
if (hive) {
if (atomic_cmpxchg(&hive->in_reset, 0, 1) != 0) {
DRM_INFO("Bailing on TDR for s_job:%llx, hive: %llx as another already in progress", @@ -4999,7 +5000,7 @@ int amdgpu_device_gpu_recover(struct amdgpu_device *adev,
* to put adev in the 1st position.
*/
INIT_LIST_HEAD(&device_list);
- if (adev->gmc.xgmi.num_physical_nodes > 1) {
+ if (!amdgpu_sriov_vf(adev) && (adev->gmc.xgmi.num_physical_nodes > 1))
+{
list_for_each_entry(tmp_adev, &hive->device_list, gmc.xgmi.head)
list_add_tail(&tmp_adev->reset_list, &device_list);
if (!list_is_first(&adev->reset_list, &device_list))
--
2.17.1
More information about the amd-gfx
mailing list