[PATCH 2/2] drm/amdgpu: exclude duplicate pages from UMC RAS UE count

Lazar, Lijo Lijo.Lazar at amd.com
Fri Feb 17 08:38:34 UTC 2023


[AMD Official Use Only - General]

Instead of adding another function, why not make pages param as in/out.  

In = number of pages to be added
Out = number of pages actually added.

int amdgpu_ras_add_bad_pages(struct amdgpu_device *adev,
                struct eeprom_table_record *bps, int *pages)


Thanks,
Lijo

-----Original Message-----
From: Zhou1, Tao <Tao.Zhou1 at amd.com> 
Sent: Friday, February 17, 2023 9:23 AM
To: amd-gfx at lists.freedesktop.org; Zhang, Hawking <Hawking.Zhang at amd.com>; Yang, Stanley <Stanley.Yang at amd.com>; Chai, Thomas <YiPeng.Chai at amd.com>; Li, Candice <Candice.Li at amd.com>; Lazar, Lijo <Lijo.Lazar at amd.com>
Cc: Zhou1, Tao <Tao.Zhou1 at amd.com>
Subject: [PATCH 2/2] drm/amdgpu: exclude duplicate pages from UMC RAS UE count

If a UMC bad page is reserved but not freed by an application, the application may trigger uncorrectable error repeatly by accessing the page.

v2: add specific function to do the check.
v3: remove duplicate pages, calculate new added bad page number.

Signed-off-by: Tao Zhou <tao.zhou1 at amd.com>
---
 drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c | 23 +++++++++++++++++++++++  drivers/gpu/drm/amd/amdgpu/amdgpu_ras.h |  2 ++  drivers/gpu/drm/amd/amdgpu/amdgpu_umc.c |  2 ++
 3 files changed, 27 insertions(+)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c
index 6e543558386d..777f85f3e5eb 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c
@@ -2115,6 +2115,29 @@ int amdgpu_ras_save_bad_pages(struct amdgpu_device *adev)
 	return 0;
 }
 
+/* Remove duplicate pages, calculate new added bad page number.
+ * Note: the function should be called between amdgpu_ras_add_bad_pages
+ * and amdgpu_ras_save_bad_pages.
+ */
+int amdgpu_ras_umc_new_ue_count(struct amdgpu_device *adev) {
+	struct amdgpu_ras *con = amdgpu_ras_get_context(adev);
+	struct ras_err_handler_data *data;
+	struct amdgpu_ras_eeprom_control *control;
+	int save_count;
+
+	if (!con || !con->eh_data)
+		return 0;
+
+	mutex_lock(&con->recovery_lock);
+	control = &con->eeprom_control;
+	data = con->eh_data;
+	save_count = data->count - control->ras_num_recs;
+	mutex_unlock(&con->recovery_lock);
+
+	return (save_count / adev->umc.retire_unit); }
+
 /*
  * read error record array in eeprom and reserve enough space for
  * storing new bad pages
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.h b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.h
index f2ad999993f6..e89c95438a88 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.h
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.h
@@ -549,6 +549,8 @@ int amdgpu_ras_add_bad_pages(struct amdgpu_device *adev,
 
 int amdgpu_ras_save_bad_pages(struct amdgpu_device *adev);
 
+int amdgpu_ras_umc_new_ue_count(struct amdgpu_device *adev);
+
 static inline enum ta_ras_block
 amdgpu_ras_block_to_ta(enum amdgpu_ras_block block) {
 	switch (block) {
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_umc.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_umc.c
index 1c7fcb4f2380..45b6be7277dd 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_umc.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_umc.c
@@ -147,6 +147,8 @@ static int amdgpu_umc_do_page_retirement(struct amdgpu_device *adev,
 			err_data->err_addr_cnt) {
 			amdgpu_ras_add_bad_pages(adev, err_data->err_addr,
 						err_data->err_addr_cnt);
+			err_data->ue_count = amdgpu_ras_umc_new_ue_count(adev);
+
 			amdgpu_ras_save_bad_pages(adev);
 
 			amdgpu_dpm_send_hbm_bad_pages_num(adev, con->eeprom_control.ras_num_recs);
--
2.35.1


More information about the amd-gfx mailing list