[PATCH 2/2] drm/amd/amdgpu: use the default reset for ras recovery

Zhang, GuoQing (Sam) GuoQing.Zhang at amd.com
Mon May 6 08:16:59 UTC 2024


[AMD Official Use Only - General]

Hi @Deucher, Alexander<mailto:Alexander.Deucher at amd.com> and @Koenig, Christian<mailto:Christian.Koenig at amd.com>

Could you help review this patch?
Without this patch, when customer set `reset_method=3` modprobe param to use mode2 reset, ras recovery will also use mode2 reset and skip mode1 reset.
When ECC error happens, GPU can’t be recovered with mode2 reset and mode1 reset is skipped, this will cause GPU reset failure.

This patch is to always use mode1 reset for ras recovery (ECC error) when setting `reset_method=3`.

Thanks
Sam

From: Feng, Kenneth <Kenneth.Feng at amd.com>
Date: Monday, April 29, 2024 at 16:15
To: Feng, Kenneth <Kenneth.Feng at amd.com>, amd-gfx at lists.freedesktop.org <amd-gfx at lists.freedesktop.org>, Zhang, GuoQing (Sam) <GuoQing.Zhang at amd.com>
Cc: Zhang, Owen(SRDC) <Owen.Zhang2 at amd.com>, Aldabagh, Maad <Maad.Aldabagh at amd.com>, Ma, Qing (Mark) <Qing.Ma at amd.com>
Subject: RE: [PATCH 2/2] drm/amd/amdgpu: use the default reset for ras recovery
[AMD Official Use Only - General]

+ at Zhang, GuoQing (Sam)

-----Original Message-----
From: Kenneth Feng <kenneth.feng at amd.com>
Sent: Monday, April 29, 2024 3:32 PM
To: amd-gfx at lists.freedesktop.org
Cc: Zhang, Owen(SRDC) <Owen.Zhang2 at amd.com>; Aldabagh, Maad <Maad.Aldabagh at amd.com>; Ma, Qing (Mark) <Qing.Ma at amd.com>; Feng, Kenneth <Kenneth.Feng at amd.com>
Subject: [PATCH 2/2] drm/amd/amdgpu: use the default reset for ras recovery

use the default reset for ras recovery

Signed-off-by: Kenneth Feng <kenneth.feng at amd.com>
---
 drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c | 7 +++++++
 1 file changed, 7 insertions(+)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c
index a037e8fba29f..f92b2c4f0d5c 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c
@@ -2437,6 +2437,7 @@ static void amdgpu_ras_do_recovery(struct work_struct *work)
        struct amdgpu_device *adev = ras->adev;
        struct list_head device_list, *device_list_handle =  NULL;
        struct amdgpu_hive_info *hive = amdgpu_get_xgmi_hive(adev);
+       int save_reset_method = amdgpu_reset_method;

        if (hive) {
                atomic_set(&hive->ras_recovery, 1);
@@ -2501,7 +2502,13 @@ static void amdgpu_ras_do_recovery(struct work_struct *work)
                        }
                }

+               if (amdgpu_gpu_recovery == 2)
+                       amdgpu_reset_method = -1;
+
                amdgpu_device_gpu_recover(ras->adev, NULL, &reset_context);
+
+               if (amdgpu_gpu_recovery == 2)
+                       amdgpu_reset_method = save_reset_method;
        }
        atomic_set(&ras->in_recovery, 0);
        if (hive) {
--
2.34.1
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://lists.freedesktop.org/archives/amd-gfx/attachments/20240506/2e8bf366/attachment-0001.htm>


More information about the amd-gfx mailing list