[PATCH 0/9] BAD GPU retirement policy by total bad pages
Guchun Chen
guchun.chen at amd.com
Thu Jul 23 08:33:37 UTC 2020
The series is to enable GPU retirement feature, which is trigged
when bad pages detected by RAS ECC exceed the threshold value.
When the saved bad pages written to eeprom reach the threshold,
one ras recovery will be issued immediately and the recovery will
fail to tell user that the GPU is BAD and needs to be retired for
further check.
During bootup, similar BAD GPU check is conducted as well when
eeprom get initialized, and it will break boot up for user's
awareness.
User could set bad_page_threshold=0 when probing amdgpu driver to
disable this feature and bring up GPU as usual.
Guchun Chen (9):
drm/amdgpu: add bad page count threshold in module parameter
drm/amdgpu: validate bad page threshold in ras
drm/amdgpu: add bad gpu tag definition
drm/amdgpu: break driver init process when it's bad GPU
drm/amdgpu: skip bad page reservation once issuing from eeprom write
drm/amdgpu: schedule ras recovery when reaching bad page threshold
drm/amdgpu: break GPU recovery once it's bad
drm/amdgpu: restore ras flags when user resets eeprom
drm/amdgpu: calculate actual size instead of hardcode size
drivers/gpu/drm/amd/amdgpu/amdgpu.h | 1 +
drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 29 ++++-
drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c | 11 ++
drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c | 77 ++++++++++++-
drivers/gpu/drm/amd/amdgpu/amdgpu_ras.h | 19 ++-
.../gpu/drm/amd/amdgpu/amdgpu_ras_eeprom.c | 109 ++++++++++++++++--
.../gpu/drm/amd/amdgpu/amdgpu_ras_eeprom.h | 8 +-
7 files changed, 230 insertions(+), 24 deletions(-)
--
2.17.1
More information about the amd-gfx
mailing list