[PATCH 02/14] drm/amdgpu: add ACA kernel hardware error log support

Wang, Yang(Kevin) KevinYang.Wang at amd.com
Wed Jan 3 10:42:38 UTC 2024


[AMD Official Use Only - General]


Additional:
This idx/count can help us to distinguish whether those logs are from the same query results.

If this information is removed, we cannot confirm from the log whether any log is missing.
e.g: [4/8], then we can know that there are 4 more that are not printed due to some reason.

Best Regards,
Kevin

_____________________________________________
From: Wang, Yang(Kevin)
Sent: Wednesday, January 3, 2024 5:51 PM
To: Zhang, Hawking <Hawking.Zhang at amd.com>; amd-gfx at lists.freedesktop.org
Cc: Zhou1, Tao <Tao.Zhou1 at amd.com>; Chai, Thomas <YiPeng.Chai at amd.com>
Subject: RE: [PATCH 02/14] drm/amdgpu: add ACA kernel hardware error log support

_____________________________________________
From: Zhang, Hawking <Hawking.Zhang at amd.com<mailto:Hawking.Zhang at amd.com>>
Sent: Wednesday, January 3, 2024 5:36 PM
To: Wang, Yang(Kevin) <KevinYang.Wang at amd.com<mailto:KevinYang.Wang at amd.com>>; amd-gfx at lists.freedesktop.org<mailto:amd-gfx at lists.freedesktop.org>
Cc: Zhou1, Tao <Tao.Zhou1 at amd.com<mailto:Tao.Zhou1 at amd.com>>; Chai, Thomas <YiPeng.Chai at amd.com<mailto:YiPeng.Chai at amd.com>>
Subject: RE: [PATCH 02/14] drm/amdgpu: add ACA kernel hardware error log support


[AMD Official Use Only - General]



+       dev_info(adev->dev, "[Hardware error] Accelerator Check Architecture events logged\n");
+       /* plus 1 for output format, e.g: ACA[08/08]: xxxx */
+       for (i = 0; i < ARRAY_SIZE(aca_regs); i++)
+               dev_info(adev->dev, "[Hardware error] ACA[%02d/%02d].%s=0x%016llx\n",
+                        idx + 1, total, aca_regs[i].name, bank->regs[aca_regs[i].reg_idx]);

We should keep the ACA log format simple since there are tools like crash dumper that grab these logs.

How about formatting log as below
dev_info(adev->dev, "[Hardware error] Accelerator Check Architecture (ACA) events logged\n");
dev_info(adev→dev, "[Hardware error] ACA.%s=0x%016llx\n");

In general, if the idx doesn't convey useful information, then just replace it with ACA.Reg.

[Kevin]:

Agree, will update it in next version.

Best Regards,
Kevin

Thoughts?

Regards,
Hawking

-----Original Message-----
From: Wang, Yang(Kevin) <KevinYang.Wang at amd.com<mailto:KevinYang.Wang at amd.com>>
Sent: Wednesday, January 3, 2024 16:02
To: amd-gfx at lists.freedesktop.org<mailto:amd-gfx at lists.freedesktop.org>
Cc: Zhang, Hawking <Hawking.Zhang at amd.com<mailto:Hawking.Zhang at amd.com>>; Zhou1, Tao <Tao.Zhou1 at amd.com<mailto:Tao.Zhou1 at amd.com>>; Chai, Thomas <YiPeng.Chai at amd.com<mailto:YiPeng.Chai at amd.com>>; Wang, Yang(Kevin) <KevinYang.Wang at amd.com<mailto:KevinYang.Wang at amd.com>>
Subject: [PATCH 02/14] drm/amdgpu: add ACA kernel hardware error log support

add ACA kernel hardware error log support.

Signed-off-by: Yang Wang <kevinyang.wang at amd.com<mailto:kevinyang.wang at amd.com>>
---
 drivers/gpu/drm/amd/amdgpu/amdgpu_aca.c | 29 +++++++++++++++++++++++++
 1 file changed, 29 insertions(+)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_aca.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_aca.c
index 6a6f167b5380..cadeda64eded 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_aca.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_aca.c
@@ -100,6 +100,33 @@ static int aca_smu_get_valid_aca_count(struct amdgpu_device *adev, enum aca_erro
        return smu_funcs->get_valid_aca_count(adev, type, count);  }

+static struct aca_regs_dump {
+       const char *name;
+       int reg_idx;
+} aca_regs[] = {
+       {"CONTROL",             ACA_REG_IDX_CTL},
+       {"STATUS",              ACA_REG_IDX_STATUS},
+       {"ADDR",                ACA_REG_IDX_ADDR},
+       {"MISC",                ACA_REG_IDX_MISC0},
+       {"CONFIG",              ACA_REG_IDX_CONFG},
+       {"IPID",                ACA_REG_IDX_IPID},
+       {"SYND",                ACA_REG_IDX_SYND},
+       {"DESTAT",              ACA_REG_IDX_DESTAT},
+       {"DEADDR",              ACA_REG_IDX_DEADDR},
+       {"CONTROL_MASK",        ACA_REG_IDX_CTL_MASK},
+};
+
+static void aca_smu_bank_dump(struct amdgpu_device *adev, int idx, int
+total, struct aca_bank *bank) {
+       int i;
+
+       dev_info(adev->dev, "[Hardware error] Accelerator Check Architecture events logged\n");
+       /* plus 1 for output format, e.g: ACA[08/08]: xxxx */
+       for (i = 0; i < ARRAY_SIZE(aca_regs); i++)
+               dev_info(adev->dev, "[Hardware error] ACA[%02d/%02d].%s=0x%016llx\n",
+                        idx + 1, total, aca_regs[i].name, bank->regs[aca_regs[i].reg_idx]);
+}
+
 static int aca_smu_get_valid_aca_banks(struct amdgpu_device *adev, enum aca_error_type type,
                                       int start, int count,
                                       struct aca_banks *banks)
@@ -137,6 +164,8 @@ static int aca_smu_get_valid_aca_banks(struct amdgpu_device *adev, enum aca_erro
                if (ret)
                        return ret;

+               aca_smu_bank_dump(adev, i, count, &bank);
+
                ret = aca_banks_add_bank(banks, &bank);
                if (ret)
                        return ret;
--
2.34.1

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://lists.freedesktop.org/archives/amd-gfx/attachments/20240103/6dd140bc/attachment-0001.htm>


More information about the amd-gfx mailing list