[PATCH 00/12] BAD GPU retirement policy by total bad pages
Guchun Chen
guchun.chen at amd.com
Tue Jul 28 07:49:22 UTC 2020
The series is to enable/disable bad page feature and apply different
bad page reservation strategy by different bad page threshold
configurations.
When the saved bad pages written to eeprom reach the threshold,
one ras recovery will be issued immediately and the recovery will
fail to tell user that the GPU is BAD and needs to be retired for
further check or setting one valid bigger threshold value in next
driver's probe to skip corresponding check.
During bootup, similar bad page threshold check is conducted as
well when eeprom get initialized, and it will possibly break boot
up for user's awareness.
When user sets bad_page_threshold=0 once probing driver, bad page
retirement feature is completely disabled, and driver has no chance to
process bad page information record and write it to eeprom.
Guchun Chen (12):
drm/amdgpu: add bad page count threshold in module parameter
drm/amdgpu: validate bad page threshold in ras
drm/amdgpu: add bad gpu tag definition
drm/amdgpu: break driver init process when it's bad GPU
drm/amdgpu: skip bad page reservation once issuing from eeprom write
drm/amdgpu: schedule ras recovery when reaching bad page threshold
drm/amdgpu: break GPU recovery once it's in bad state
drm/amdgpu: restore ras flags when user resets eeprom
drm/amdgpu: define one macro for RAS's sysfs/debugfs name
drm/amdgpu: decouple sysfs creating of bad page node
drm/amdgpu: disable page reservation when amdgpu_bad_page_threshold =
0
drm/amdgpu: reset eeprom once specifying one bigger threshold
drivers/gpu/drm/amd/amdgpu/amdgpu.h | 1 +
drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 32 ++-
drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c | 11 ++
drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c | 186 ++++++++++++++----
drivers/gpu/drm/amd/amdgpu/amdgpu_ras.h | 19 +-
.../gpu/drm/amd/amdgpu/amdgpu_ras_eeprom.c | 102 +++++++++-
.../gpu/drm/amd/amdgpu/amdgpu_ras_eeprom.h | 9 +-
drivers/gpu/drm/amd/amdgpu/amdgpu_umc.c | 5 +-
8 files changed, 312 insertions(+), 53 deletions(-)
--
2.17.1
More information about the amd-gfx
mailing list