[PATCH 4/5] drm/amdgpu: add completion to wait for ras reset to complete

Wed Jun 19 03:04:22 UTC 2024

[AMD Official Use Only - AMD Internal Distribution Only]

-----------------
Best Regards,
Thomas

-----Original Message-----
From: Lazar, Lijo <Lijo.Lazar at amd.com>
Sent: Tuesday, June 18, 2024 8:00 PM
To: Chai, Thomas <YiPeng.Chai at amd.com>; amd-gfx at lists.freedesktop.org
Cc: Zhang, Hawking <Hawking.Zhang at amd.com>; Zhou1, Tao <Tao.Zhou1 at amd.com>; Li, Candice <Candice.Li at amd.com>; Wang, Yang(Kevin) <KevinYang.Wang at amd.com>; Yang, Stanley <Stanley.Yang at amd.com>
Subject: Re: [PATCH 4/5] drm/amdgpu: add completion to wait for ras reset to complete


On 6/18/2024 4:51 PM, Chai, Thomas wrote:
> [AMD Official Use Only - AMD Internal Distribution Only]
>
> -----------------
> Best Regards,
> Thomas
>
> -----Original Message-----
> From: Chai, Thomas
> Sent: Tuesday, June 18, 2024 7:09 PM
> To: Lazar, Lijo <Lijo.Lazar at amd.com>; amd-gfx at lists.freedesktop.org
> Cc: Zhang, Hawking <Hawking.Zhang at amd.com>; Zhou1, Tao
> <Tao.Zhou1 at amd.com>; Li, Candice <Candice.Li at amd.com>; Wang,
> Yang(Kevin) <KevinYang.Wang at amd.com>; Yang, Stanley
> <Stanley.Yang at amd.com>
> Subject: RE: [PATCH 4/5] drm/amdgpu: add completion to wait for ras
> reset to complete
>
>
>
>
> -----------------
> Best Regards,
> Thomas
>
> -----Original Message-----
> From: Lazar, Lijo <Lijo.Lazar at amd.com>
> Sent: Tuesday, June 18, 2024 6:09 PM
> To: Chai, Thomas <YiPeng.Chai at amd.com>; amd-gfx at lists.freedesktop.org
> Cc: Zhang, Hawking <Hawking.Zhang at amd.com>; Zhou1, Tao
> <Tao.Zhou1 at amd.com>; Li, Candice <Candice.Li at amd.com>; Wang,
> Yang(Kevin) <KevinYang.Wang at amd.com>; Yang, Stanley
> <Stanley.Yang at amd.com>
> Subject: Re: [PATCH 4/5] drm/amdgpu: add completion to wait for ras
> reset to complete
>
>
>
> On 6/18/2024 12:03 PM, YiPeng Chai wrote:
>> Add completion to wait for ras reset to complete.
>>
>> Signed-off-by: YiPeng Chai <YiPeng.Chai at amd.com>
>> ---
>>  drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c | 11 +++++++++++
>> drivers/gpu/drm/amd/amdgpu/amdgpu_ras.h |  1 +
>>  2 files changed, 12 insertions(+)
>>
>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c
>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c
>> index 898889600771..7f8e6ca07957 100644
>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c
>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c
>> @@ -124,6 +124,8 @@ const char *get_ras_block_str(struct
>> ras_common_if
>> *ras_block)
>>
>>  #define AMDGPU_RAS_RETIRE_PAGE_INTERVAL 100  //ms
>>
>> +#define MAX_RAS_RECOVERY_COMPLETION_TIME  120000 //ms
>> +
>>  enum amdgpu_ras_retire_page_reservation {
>>       AMDGPU_RAS_RETIRE_PAGE_RESERVED,
>>       AMDGPU_RAS_RETIRE_PAGE_PENDING, @@ -2518,6 +2520,8 @@ static
>> void amdgpu_ras_do_recovery(struct work_struct *work)
>>               atomic_set(&hive->ras_recovery, 0);
>>               amdgpu_put_xgmi_hive(hive);
>>       }
>> +
>> +     complete_all(&ras->ras_recovery_completion);
>>  }
>>
>>  /* alloc/realloc bps array */
>> @@ -2911,10 +2915,16 @@ static int
>> amdgpu_ras_poison_consumption_handler(struct amdgpu_device *adev,
>>
>>               flush_delayed_work(&con->page_retirement_dwork);
>>
>> +             reinit_completion(&con->ras_recovery_completion);
>> +
>>               con->gpu_reset_flags |= reset;
>>               amdgpu_ras_reset_gpu(adev);
>>
>>               *gpu_reset = reset;
>> +             if (!wait_for_completion_timeout(&con->ras_recovery_completion,
>> +                             msecs_to_jiffies(MAX_RAS_RECOVERY_COMPLETION_TIME)))
>> +                     dev_err(adev->dev, "Waiting for GPU to complete ras reset timeout! reset:0x%x\n",
>> +                             reset);
>
>> If a mode-1 reset gets to execute first due to job timeout/hws detect cases in poison timeout, then the ras handler will never get executed.
>> Why this wait is required?
>
>> Thanks,
>> Lijo
>
> [Thomas]  "[PATCH 5/5] drm/amdgpu: add gpu reset check and exception handling" add the check before ras gpu reset.
>                 Poison ras reset is different from reset triggered by other fatal errors, and all poison RAS resets are triggered from here,
>              in order to distinguish other gpu resets and facilitate subsequent  code processing, so add wait for gpu ras reset here.
>

> Reset mechanism resets the GPU state - whether it's triggered due to poison or fatal errors. As soon as the device is reset successfully, GPU operations can continue.

>So why there needs to be a special wait for poison triggred reset alone?
[Thomas] Different applications may randomly trigger poison errors before gpu reset.
                 Since poison gpu reset is triggered asynchronously, new poison consumption interrupts may occur in the period after gpu reset request is sent and before the GPU reset is actually performed..
                  In order to avoid performing a poison gpu reset again after completing the current poison gpu reset,  It need to stay here to wait for gpu to complete reset and then clear the cached poison consumption messages.

>Why not wait on the RAS recovery work object  rather than another completion notification?
[Thomas] Yes, "wait on RAS recovery work object" is a good idea,  I will do it.


Thanks,
Lijo

>>       }
>>
>>       return 0;
>> @@ -3041,6 +3051,7 @@ int amdgpu_ras_recovery_init(struct amdgpu_device *adev)
>>               }
>>       }
>>
>> +     init_completion(&con->ras_recovery_completion);
>>       mutex_init(&con->page_rsv_lock);
>>       INIT_KFIFO(con->poison_fifo);
>>       mutex_init(&con->page_retirement_lock);
>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.h
>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.h
>> index 91daf48be03a..b47f03edac87 100644
>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.h
>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.h
>> @@ -537,6 +537,7 @@ struct amdgpu_ras {
>>       DECLARE_KFIFO(poison_fifo, struct ras_poison_msg, 128);
>>       struct ras_ecc_log_info  umc_ecc_log;
>>       struct delayed_work page_retirement_dwork;
>> +     struct completion ras_recovery_completion;
>>
>>       /* Fatal error detected flag */
>>       atomic_t fed;