[PATCH 00/12] BAD GPU retirement policy by total bad pages
Zhang, Hawking
Hawking.Zhang at amd.com
Thu Jul 30 14:20:45 UTC 2020
[AMD Public Use]
The new series looks good to me in general, but still the following nitpicks
Patch #1
Default typical value --> default value
Patch #4
+ DRM_ERROR("Exceeding the bad_page_threshold parameter, "
+ "disabling the GPU.\n");
Let's use dev_err to have device pci bdf number in dmesg
Patch #8
Please remove the redundant {}
+ } else {
+ return -EIO;
+ }
Patch #12
It would be better to use dev_info/dev_err for more readable information in mGPU case.
+ DRM_INFO("Using one valid bigger bad page threshold "
+ "and correcting eeprom header tag.\n");
+ ret = amdgpu_ras_eeprom_correct_header_tag(control,
+ EEPROM_TABLE_HDR_VAL);
+ } else {
+ *exceed_err_limit = true;
+ DRM_ERROR("Exceeding the bad_page_threshold parameter, "
"disabling the GPU.\n");
The series is in high risk to introduce regression, please do conduct comprehensive testing before commit.
With above address, the series is
Reviewed-by: Hawking Zhang <Hawking.Zhang at amd.com>
Regards,
hawking
-----Original Message-----
From: Chen, Guchun <Guchun.Chen at amd.com>
Sent: Wednesday, July 29, 2020 10:56
To: amd-gfx at lists.freedesktop.org; Deucher, Alexander <Alexander.Deucher at amd.com>; Zhang, Hawking <Hawking.Zhang at amd.com>; Li, Dennis <Dennis.Li at amd.com>; Grodzovsky, Andrey <Andrey.Grodzovsky at amd.com>; Zhou1, Tao <Tao.Zhou1 at amd.com>; Clements, John <John.Clements at amd.com>; Lazar, Lijo <Lijo.Lazar at amd.com>; Koenig, Christian <Christian.Koenig at amd.com>
Cc: Chen, Guchun <Guchun.Chen at amd.com>
Subject: [PATCH 00/12] BAD GPU retirement policy by total bad pages
The series is to enable/disable bad page feature and apply different bad page reservation strategy by different bad page threshold configurations.
When the saved bad pages written to eeprom reach the threshold, one ras recovery will be issued immediately and the recovery will fail to tell user that the GPU is BAD and needs to be retired for further check or setting one valid bigger threshold value in next driver's probe to skip corresponding check.
During bootup, similar bad page threshold check is conducted as well when eeprom get initialized, and it will possibly break boot up for user's awareness.
When user sets bad_page_threshold=0 once probing driver, bad page retirement feature is completely disabled, and driver has no chance to process bad page information record and write it to eeprom.
Guchun Chen (12):
drm/amdgpu: add bad page count threshold in module parameter
drm/amdgpu: validate bad page threshold in ras
drm/amdgpu: add bad gpu tag definition
drm/amdgpu: break driver init process when it's bad GPU
drm/amdgpu: skip bad page reservation once issuing from eeprom write
drm/amdgpu: schedule ras recovery when reaching bad page threshold
drm/amdgpu: break GPU recovery once it's in bad state
drm/amdgpu: restore ras flags when user resets eeprom
drm/amdgpu: add one definition for RAS's sysfs/debugfs name
drm/amdgpu: decouple sysfs creating of bad page node
drm/amdgpu: disable page reservation when amdgpu_bad_page_threshold =
0
drm/amdgpu: update eeprom once specifying one bigger threshold
drivers/gpu/drm/amd/amdgpu/amdgpu.h | 1 +
drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 32 ++-
drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c | 11 ++
drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c | 186 ++++++++++++++----
drivers/gpu/drm/amd/amdgpu/amdgpu_ras.h | 19 +-
.../gpu/drm/amd/amdgpu/amdgpu_ras_eeprom.c | 121 +++++++++++-
.../gpu/drm/amd/amdgpu/amdgpu_ras_eeprom.h | 9 +-
drivers/gpu/drm/amd/amdgpu/amdgpu_umc.c | 5 +-
8 files changed, 331 insertions(+), 53 deletions(-)
--
2.17.1
More information about the amd-gfx
mailing list