[PATCH 2/2] update check condition of query for ras page retire

Zhang, Hawking Hawking.Zhang at amd.com
Thu Jan 18 08:16:34 UTC 2024


[AMD Official Use Only - General]

Series is

Reviewed-by: Hawking Zhang <Hawking.Zhang at amd.com>

Regards,
Hawking
-----Original Message-----
From: amd-gfx <amd-gfx-bounces at lists.freedesktop.org> On Behalf Of Tao Zhou
Sent: Thursday, January 18, 2024 15:36
To: amd-gfx at lists.freedesktop.org
Cc: Zhou1, Tao <Tao.Zhou1 at amd.com>
Subject: [PATCH 2/2] update check condition of query for ras page retire

Support page retirement handling in debug mode.

v2: revert smu_v13_0_6_get_ecc_info directly.

Signed-off-by: Tao Zhou <tao.zhou1 at amd.com>
Change-Id: I0aaa807d7fe87b3da0f023c380e57ab6dd446fcf
---
 drivers/gpu/drm/amd/amdgpu/amdgpu_umc.c | 8 ++++++--
 1 file changed, 6 insertions(+), 2 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_umc.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_umc.c
index 9d1cf41cf483..d8d263956e85 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_umc.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_umc.c
@@ -93,11 +93,14 @@ static int amdgpu_umc_do_page_retirement(struct amdgpu_device *adev,
        struct ras_err_data *err_data = (struct ras_err_data *)ras_error_status;
        struct amdgpu_ras *con = amdgpu_ras_get_context(adev);
        int ret = 0;
+       unsigned int error_query_mode;
        unsigned long err_count;

        kgd2kfd_set_sram_ecc_flag(adev->kfd.dev);
+       amdgpu_ras_get_error_query_mode(adev, &error_query_mode);
        ret = amdgpu_dpm_get_ecc_info(adev, (void *)&(con->umc_ecc));
-       if (ret == -EOPNOTSUPP) {
+       if (ret == -EOPNOTSUPP &&
+           error_query_mode == AMDGPU_RAS_DIRECT_ERROR_QUERY) {
                if (adev->umc.ras && adev->umc.ras->ras_block.hw_ops &&
                    adev->umc.ras->ras_block.hw_ops->query_ras_error_count)
                    adev->umc.ras->ras_block.hw_ops->query_ras_error_count(adev, ras_error_status); @@ -121,7 +124,8 @@ static int amdgpu_umc_do_page_retirement(struct amdgpu_device *adev,
                         */
                        adev->umc.ras->ras_block.hw_ops->query_ras_error_address(adev, ras_error_status);
                }
-       } else if (!ret) {
+       } else if (error_query_mode == AMDGPU_RAS_FIRMWARE_ERROR_QUERY ||
+           (!ret && error_query_mode == AMDGPU_RAS_DIRECT_ERROR_QUERY)) {
                if (adev->umc.ras &&
                    adev->umc.ras->ecc_info_query_ras_error_count)
                    adev->umc.ras->ecc_info_query_ras_error_count(adev, ras_error_status);
--
2.34.1



More information about the amd-gfx mailing list