[PATCH V2 2/2] drm/amdgpu: timely save bad pages to eeprom after gpu ras reset is completed
Zhou1, Tao
Tao.Zhou1 at amd.com
Tue Jul 9 06:12:59 UTC 2024
[AMD Official Use Only - AMD Internal Distribution Only]
The series is:
Reviewed-by: Tao Zhou <tao.zhou1 at amd.com>
> -----Original Message-----
> From: Chai, Thomas <YiPeng.Chai at amd.com>
> Sent: Tuesday, July 9, 2024 1:56 PM
> To: amd-gfx at lists.freedesktop.org
> Cc: Zhang, Hawking <Hawking.Zhang at amd.com>; Zhou1, Tao
> <Tao.Zhou1 at amd.com>; Li, Candice <Candice.Li at amd.com>; Wang, Yang(Kevin)
> <KevinYang.Wang at amd.com>; Yang, Stanley <Stanley.Yang at amd.com>; Chai,
> Thomas <YiPeng.Chai at amd.com>
> Subject: [PATCH V2 2/2] drm/amdgpu: timely save bad pages to eeprom after gpu
> ras reset is completed
>
> The problem case is as follows:
> 1. GPU A triggers a gpu ras reset, and GPU A drives
> GPU B to also perform a gpu ras reset.
> 2. After gpu B ras reset started, gpu B queried a DE
> data. Since the DE data was queried in the ras reset
> thread instead of the page retirement thread, bad
> page retirement work would not be triggered. Then
> even if all gpu resets are completed, the bad pages
> will be cached in RAM until GPU B's bad page retirement
> work is triggered again and then saved to eeprom.
>
> This patch can save the bad pages to eeprom in time after gpu ras reset is
> completed.
>
> v2:
> 1. Add the above description to code comments.
> 2. Reuse existing function.
>
> Signed-off-by: YiPeng Chai <YiPeng.Chai at amd.com>
> ---
> drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c | 6 +++++-
> drivers/gpu/drm/amd/amdgpu/umc_v12_0.c | 18 ++++++++++++++++++
> 2 files changed, 23 insertions(+), 1 deletion(-)
>
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c
> b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c
> index d923151af752..34226ae010c7 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c
> @@ -2864,8 +2864,12 @@ static void amdgpu_ras_do_page_retirement(struct
> work_struct *work)
> struct ras_err_data err_data;
> unsigned long err_cnt;
>
> - if (amdgpu_in_reset(adev) || amdgpu_ras_in_recovery(adev))
> + /* If gpu reset is ongoing, delay retiring the bad pages */
> + if (amdgpu_in_reset(adev) || amdgpu_ras_in_recovery(adev)) {
> + amdgpu_ras_schedule_retirement_dwork(con,
> + AMDGPU_RAS_RETIRE_PAGE_INTERVAL * 3);
> return;
> + }
>
> amdgpu_ras_error_data_init(&err_data);
>
> diff --git a/drivers/gpu/drm/amd/amdgpu/umc_v12_0.c
> b/drivers/gpu/drm/amd/amdgpu/umc_v12_0.c
> index 0faa21d8a7b4..9dbb13adb661 100644
> --- a/drivers/gpu/drm/amd/amdgpu/umc_v12_0.c
> +++ b/drivers/gpu/drm/amd/amdgpu/umc_v12_0.c
> @@ -29,6 +29,7 @@
> #include "mp/mp_13_0_6_sh_mask.h"
>
> #define MAX_ECC_NUM_PER_RETIREMENT 32
> +#define DELAYED_TIME_FOR_GPU_RESET 1000 //ms
>
> static inline uint64_t get_umc_v12_0_reg_offset(struct amdgpu_device *adev,
> uint32_t node_inst,
> @@ -568,6 +569,23 @@ static int umc_v12_0_update_ecc_status(struct
> amdgpu_device *adev,
>
> con->umc_ecc_log.de_queried_count++;
>
> + /* The problem case is as follows:
> + * 1. GPU A triggers a gpu ras reset, and GPU A drives
> + * GPU B to also perform a gpu ras reset.
> + * 2. After gpu B ras reset started, gpu B queried a DE
> + * data. Since the DE data was queried in the ras reset
> + * thread instead of the page retirement thread, bad
> + * page retirement work would not be triggered. Then
> + * even if all gpu resets are completed, the bad pages
> + * will be cached in RAM until GPU B's bad page retirement
> + * work is triggered again and then saved to eeprom.
> + * Trigger delayed work to save the bad pages to eeprom in time
> + * after gpu ras reset is completed.
> + */
> + if (amdgpu_ras_in_recovery(adev))
> + schedule_delayed_work(&con->page_retirement_dwork,
> + msecs_to_jiffies(DELAYED_TIME_FOR_GPU_RESET));
> +
> return 0;
> }
>
> --
> 2.34.1
More information about the amd-gfx
mailing list