[PATCH 4/5] drm/amdgpu: add completion to wait for ras reset to complete
Chai, Thomas
YiPeng.Chai at amd.com
Tue Jun 18 11:21:40 UTC 2024
[AMD Official Use Only - AMD Internal Distribution Only]
-----------------
Best Regards,
Thomas
-----Original Message-----
From: Chai, Thomas
Sent: Tuesday, June 18, 2024 7:09 PM
To: Lazar, Lijo <Lijo.Lazar at amd.com>; amd-gfx at lists.freedesktop.org
Cc: Zhang, Hawking <Hawking.Zhang at amd.com>; Zhou1, Tao <Tao.Zhou1 at amd.com>; Li, Candice <Candice.Li at amd.com>; Wang, Yang(Kevin) <KevinYang.Wang at amd.com>; Yang, Stanley <Stanley.Yang at amd.com>
Subject: RE: [PATCH 4/5] drm/amdgpu: add completion to wait for ras reset to complete
-----------------
Best Regards,
Thomas
-----Original Message-----
From: Lazar, Lijo <Lijo.Lazar at amd.com>
Sent: Tuesday, June 18, 2024 6:09 PM
To: Chai, Thomas <YiPeng.Chai at amd.com>; amd-gfx at lists.freedesktop.org
Cc: Zhang, Hawking <Hawking.Zhang at amd.com>; Zhou1, Tao <Tao.Zhou1 at amd.com>; Li, Candice <Candice.Li at amd.com>; Wang, Yang(Kevin) <KevinYang.Wang at amd.com>; Yang, Stanley <Stanley.Yang at amd.com>
Subject: Re: [PATCH 4/5] drm/amdgpu: add completion to wait for ras reset to complete
On 6/18/2024 12:03 PM, YiPeng Chai wrote:
> Add completion to wait for ras reset to complete.
>
> Signed-off-by: YiPeng Chai <YiPeng.Chai at amd.com>
> ---
> drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c | 11 +++++++++++
> drivers/gpu/drm/amd/amdgpu/amdgpu_ras.h | 1 +
> 2 files changed, 12 insertions(+)
>
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c
> b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c
> index 898889600771..7f8e6ca07957 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c
> @@ -124,6 +124,8 @@ const char *get_ras_block_str(struct ras_common_if
> *ras_block)
>
> #define AMDGPU_RAS_RETIRE_PAGE_INTERVAL 100 //ms
>
> +#define MAX_RAS_RECOVERY_COMPLETION_TIME 120000 //ms
> +
> enum amdgpu_ras_retire_page_reservation {
> AMDGPU_RAS_RETIRE_PAGE_RESERVED,
> AMDGPU_RAS_RETIRE_PAGE_PENDING,
> @@ -2518,6 +2520,8 @@ static void amdgpu_ras_do_recovery(struct work_struct *work)
> atomic_set(&hive->ras_recovery, 0);
> amdgpu_put_xgmi_hive(hive);
> }
> +
> + complete_all(&ras->ras_recovery_completion);
> }
>
> /* alloc/realloc bps array */
> @@ -2911,10 +2915,16 @@ static int
> amdgpu_ras_poison_consumption_handler(struct amdgpu_device *adev,
>
> flush_delayed_work(&con->page_retirement_dwork);
>
> + reinit_completion(&con->ras_recovery_completion);
> +
> con->gpu_reset_flags |= reset;
> amdgpu_ras_reset_gpu(adev);
>
> *gpu_reset = reset;
> + if (!wait_for_completion_timeout(&con->ras_recovery_completion,
> + msecs_to_jiffies(MAX_RAS_RECOVERY_COMPLETION_TIME)))
> + dev_err(adev->dev, "Waiting for GPU to complete ras reset timeout! reset:0x%x\n",
> + reset);
> If a mode-1 reset gets to execute first due to job timeout/hws detect cases in poison timeout, then the ras handler will never get executed.
> Why this wait is required?
>Thanks,
>Lijo
[Thomas] "[PATCH 5/5] drm/amdgpu: add gpu reset check and exception handling" add the check before ras gpu reset.
Poison ras reset is different from reset triggered by other fatal errors, and all poison RAS resets are triggered from here,
in order to distinguish other gpu resets and facilitate subsequent code processing, so add wait for gpu ras reset here.
> }
>
> return 0;
> @@ -3041,6 +3051,7 @@ int amdgpu_ras_recovery_init(struct amdgpu_device *adev)
> }
> }
>
> + init_completion(&con->ras_recovery_completion);
> mutex_init(&con->page_rsv_lock);
> INIT_KFIFO(con->poison_fifo);
> mutex_init(&con->page_retirement_lock);
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.h
> b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.h
> index 91daf48be03a..b47f03edac87 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.h
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.h
> @@ -537,6 +537,7 @@ struct amdgpu_ras {
> DECLARE_KFIFO(poison_fifo, struct ras_poison_msg, 128);
> struct ras_ecc_log_info umc_ecc_log;
> struct delayed_work page_retirement_dwork;
> + struct completion ras_recovery_completion;
>
> /* Fatal error detected flag */
> atomic_t fed;
More information about the amd-gfx
mailing list