[PATCH 1/2] drm/amdgpu: race issue when jobs on 2 ring timeout

Fri Jan 15 02:14:09 UTC 2021

[AMD Official Use Only - Internal Distribution Only]

Reviewed-by: Evan Quan <evan.quan at amd.com>

-----Original Message-----
From: Horace Chen <horace.chen at amd.com>
Sent: Thursday, January 14, 2021 9:37 PM
To: amd-gfx at lists.freedesktop.org
Cc: Grodzovsky, Andrey <Andrey.Grodzovsky at amd.com>; Quan, Evan <Evan.Quan at amd.com>; Chen, Horace <Horace.Chen at amd.com>; Tuikov, Luben <Luben.Tuikov at amd.com>; Koenig, Christian <Christian.Koenig at amd.com>; Deucher, Alexander <Alexander.Deucher at amd.com>; Xiao, Jack <Jack.Xiao at amd.com>; Zhang, Hawking <Hawking.Zhang at amd.com>; Liu, Monk <Monk.Liu at amd.com>; Xu, Feifei <Feifei.Xu at amd.com>; Wang, Kevin(Yang) <Kevin1.Wang at amd.com>; Xiaojie Yuan <xiaojie.yuan at amd.com>
Subject: [PATCH 1/2] drm/amdgpu: race issue when jobs on 2 ring timeout

Fix a racing issue when jobs on 2 rings timeout simultaneously.

If 2 rings timed out at the same time, the amdgpu_device_gpu_recover will be reentered. Then the
adev->gmc.xgmi.head will be grabbed by 2 local linked list,
which may cause wild pointer issue in iterating.

lock the device earily to prevent the node be added to 2 different lists.

for xgmi there is a hive lock which can promise there won't have
2 devices on same hive reenter the interation. So xgmi doesn't need to lock the device.

Signed-off-by: Horace Chen <horace.chen at amd.com>
---
 drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 15 ++++++++-------
 1 file changed, 8 insertions(+), 7 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
index 4d434803fb49..a28e138ac72c 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
@@ -4591,19 +4591,20 @@ int amdgpu_device_gpu_recover(struct amdgpu_device *adev,
 list_rotate_to_front(&adev->gmc.xgmi.head, &hive->device_list);
 device_list_handle = &hive->device_list;
 } else {
+/* if current dev is already in reset, skip adding list to prevent race issue */
+if (!amdgpu_device_lock_adev(adev, hive)) {
+dev_info(adev->dev, "Bailing on TDR for s_job:%llx, as another already in progress",
+job ? job->base.id : -1);
+r = 0;
+goto skip_recovery;
+}
+
 list_add_tail(&adev->gmc.xgmi.head, &device_list);
 device_list_handle = &device_list;
 }

 /* block all schedulers and reset given job's ring */
 list_for_each_entry(tmp_adev, device_list_handle, gmc.xgmi.head) {
-if (!amdgpu_device_lock_adev(tmp_adev, hive)) {
-dev_info(tmp_adev->dev, "Bailing on TDR for s_job:%llx, as another already in progress",
-  job ? job->base.id : -1);
-r = 0;
-goto skip_recovery;
-}
-
 /*
  * Try to put the audio codec into suspend state
  * before gpu reset started.
--
2.17.1