[PATCH] drm/amdgpu: Avoid VF for RAS recovery source check

Yao, Yiqing(James) YiQing.Yao at amd.com
Tue Dec 10 13:04:09 UTC 2024


[Public]

Reviewed-by: Yiqing Yao <yiqing.yao at amd.com>
Tested-by: Yiqing Yao <yiqing.yao at amd.com>

Thanks,
Yiqing(James).

________________________________
From: amd-gfx <amd-gfx-bounces at lists.freedesktop.org> on behalf of Lijo Lazar <lijo.lazar at amd.com>
Sent: Monday, December 9, 2024 11:52 PM
To: amd-gfx at lists.freedesktop.org <amd-gfx at lists.freedesktop.org>
Cc: Zhang, Hawking <Hawking.Zhang at amd.com>; Deucher, Alexander <Alexander.Deucher at amd.com>; Zhou1, Tao <Tao.Zhou1 at amd.com>; Skvortsov, Victor <Victor.Skvortsov at amd.com>; Zhao, Victor <Victor.Zhao at amd.com>; Tomasevic, Vojislav <Vojislav.Tomasevic at amd.com>
Subject: [PATCH] drm/amdgpu: Avoid VF for RAS recovery source check

VF device sets the RAS flag when mailbox data can't be read properly.
There is no conclusive way to tell if the real source is RAS error.
Therefore VF schedules a KFD based reset which doesn't set RAS source.
SKip checking RAS source for any VF scheduled recovery.

Signed-off-by: Lijo Lazar <lijo.lazar at amd.com>
Reported-by: Vojislav Tomasevic <vojislav.tomasevic at amd.com>

Fixes: 2211660c20a0 ("drm/amdgpu: Prefer RAS recovery for scheduler hang")
---
 drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 1 +
 1 file changed, 1 insertion(+)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
index 735a01c58cd7..eb3fd55a3702 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
@@ -5864,6 +5864,7 @@ int amdgpu_device_gpu_recover(struct amdgpu_device *adev,
          * detected at the same time, let RAS recovery take care of it.
          */
         if (amdgpu_ras_is_err_state(adev, AMDGPU_RAS_BLOCK__ANY) &&
+           !amdgpu_sriov_vf(adev) &&
             reset_context->src != AMDGPU_RESET_SRC_RAS) {
                 dev_dbg(adev->dev,
                         "Gpu recovery from source: %d yielding to RAS error recovery handling",
--
2.25.1

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://lists.freedesktop.org/archives/amd-gfx/attachments/20241210/9d2a9b2f/attachment.htm>


More information about the amd-gfx mailing list