[PATCH] drm/amdgpu: add uncorrectable error count print in UMC ecc irq cb
Chen, Guchun
Guchun.Chen at amd.com
Fri Apr 10 08:02:50 UTC 2020
[AMD Public Use]
Hi Hawking,
I submitted one new patch to address these rules just now. Please review.
Regards,
Guchun
_____________________________________________
From: Zhang, Hawking <Hawking.Zhang at amd.com>
Sent: Friday, April 10, 2020 1:07 PM
To: Chen, Guchun <Guchun.Chen at amd.com>; amd-gfx at lists.freedesktop.org; Li, Dennis <Dennis.Li at amd.com>; Zhou1, Tao <Tao.Zhou1 at amd.com>; Clements, John <John.Clements at amd.com>
Subject: RE: [PATCH] drm/amdgpu: add uncorrectable error count print in UMC ecc irq cb
[AMD Official Use Only - Internal Distribution Only]
Hi Guchun,
I put all the rules together. Please make the patch accordingly.
1). Use "correctable/uncorrectable *hardware* error", instead of just "correctable/uncorrectable error" in all callback functions that prints RAS error counters
2). Add wording "no user action necessary" for all the "correctable error" related kernel messages
3). For the sync flood interrupt, let's not just indicate ATHUB_ERROR_EVENT type, but also "uncorrectable hardware error (ERREVENT_ATHUB_INT) detected". And so does the BIF interrupt for the UE.
4). Replace DRM_INFO with dev_info for all the RAS related kernel messaging.
Regards,
Hawking
-----Original Message-----
From: Zhang, Hawking
Sent: Friday, April 10, 2020 13:05
To: Chen, Guchun <Guchun.Chen at amd.com<mailto:Guchun.Chen at amd.com>>; 'amd-gfx at lists.freedesktop.org' <amd-gfx at lists.freedesktop.org<mailto:amd-gfx at lists.freedesktop.org>>; Li, Dennis <Dennis.Li at amd.com<mailto:Dennis.Li at amd.com>>; Zhou1, Tao <Tao.Zhou1 at amd.com<mailto:Tao.Zhou1 at amd.com>>; Clements, John <John.Clements at amd.com<mailto:John.Clements at amd.com>>
Subject: RE: [PATCH] drm/amdgpu: add uncorrectable error count print in UMC ecc irq cb
[AMD Official Use Only - Internal Distribution Only]
And some more rules in RAS wording in kernel message.
1). Use "correctable/uncorrectable *hardware* error", instead of just "correctable/uncorrectable error" in all callback functions that prints RAS error counters 2). Add wording "no user action necessary" for all the "correctable error" related kernel messages 3). For the sync flood interrupt, let's not just indicate ATHUB_ERROR_EVENT type, but also "uncorrectable hardware error (ERREVENT_ATHUB_INT) detected". And so does the BIF interrupt for the ue.
Regards,
Hawking
-----Original Message-----
From: Zhang, Hawking
Sent: Friday, April 10, 2020 12:57
To: Chen, Guchun <Guchun.Chen at amd.com<mailto:Guchun.Chen at amd.com>>; amd-gfx at lists.freedesktop.org<mailto:amd-gfx at lists.freedesktop.org>; Li, Dennis <Dennis.Li at amd.com<mailto:Dennis.Li at amd.com>>; Zhou1, Tao <Tao.Zhou1 at amd.com<mailto:Tao.Zhou1 at amd.com>>; Clements, John <John.Clements at amd.com<mailto:John.Clements at amd.com>>
Subject: RE: [PATCH] drm/amdgpu: add uncorrectable error count print in UMC ecc irq cb
[AMD Official Use Only - Internal Distribution Only]
Hello Guchun,
Besides this, could you please also make a patch to replace DRM_INFO with dev_info in amdgpu_ras_check_supported. Basically, we'd prefer to have device bdf as the prefix in RAS related wording in kernel message, instead of DRM pre-fix.
Please also have a review again on the other RAS wording in case there is still use DRM_INFO for the print out message. We shall let user know exactly gpu device for any RAS error information.
Regards,
Hawking
-----Original Message-----
From: Chen, Guchun <Guchun.Chen at amd.com<mailto:Guchun.Chen at amd.com>>
Sent: Friday, April 10, 2020 11:55
To: amd-gfx at lists.freedesktop.org<mailto:amd-gfx at lists.freedesktop.org>; Zhang, Hawking <Hawking.Zhang at amd.com<mailto:Hawking.Zhang at amd.com>>; Li, Dennis <Dennis.Li at amd.com<mailto:Dennis.Li at amd.com>>; Zhou1, Tao <Tao.Zhou1 at amd.com<mailto:Tao.Zhou1 at amd.com>>; Clements, John <John.Clements at amd.com<mailto:John.Clements at amd.com>>
Cc: Chen, Guchun <Guchun.Chen at amd.com<mailto:Guchun.Chen at amd.com>>
Subject: [PATCH] drm/amdgpu: add uncorrectable error count print in UMC ecc irq cb
Uncorrectable error count printing is missed when issuing UMC UE injection. When going to the error count log function in GPU recover work thread, there is no chance to get correct error count value by last error injection and print, because the error status register is automatically cleared after reading in UMC ecc irq callback. So add such message printing in UMC ecc irq cb to be consistent with other RAS error interrupt cases.
Signed-off-by: Guchun Chen <guchun.chen at amd.com<mailto:guchun.chen at amd.com>>
---
drivers/gpu/drm/amd/amdgpu/amdgpu_umc.c | 3 +++
1 file changed, 3 insertions(+)
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_umc.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_umc.c
index f4d40855147b..267f7c30f4dd 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_umc.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_umc.c
@@ -121,6 +121,9 @@ int amdgpu_umc_process_ras_data_cb(struct amdgpu_device *adev,
/* only uncorrectable error needs gpu reset */
if (err_data->ue_count) {
+ dev_info(adev->dev, "%ld uncorrectable errors detected in UMC block\n",
+ err_data->ue_count);
+
if (err_data->err_addr_cnt &&
amdgpu_ras_add_bad_pages(adev, err_data->err_addr,
err_data->err_addr_cnt))
--
2.17.1
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://lists.freedesktop.org/archives/amd-gfx/attachments/20200410/18f2d523/attachment-0001.htm>
More information about the amd-gfx
mailing list