[PATCH 2/3] drm/amdgpu: add RAS poison consumption handler
Lazar, Lijo
lijo.lazar at amd.com
Wed Apr 20 06:48:29 UTC 2022
On 4/20/2022 11:51 AM, Zhou1, Tao wrote:
> [AMD Official Use Only]
>
>
>
>> -----Original Message-----
>> From: Lazar, Lijo <Lijo.Lazar at amd.com>
>> Sent: Wednesday, April 20, 2022 12:33 PM
>> To: Zhou1, Tao <Tao.Zhou1 at amd.com>; amd-gfx at lists.freedesktop.org; Zhang,
>> Hawking <Hawking.Zhang at amd.com>; Yang, Stanley
>> <Stanley.Yang at amd.com>; Ziya, Mohammad zafar
>> <Mohammadzafar.Ziya at amd.com>; Chai, Thomas <YiPeng.Chai at amd.com>
>> Subject: Re: [PATCH 2/3] drm/amdgpu: add RAS poison consumption handler
>>
>>
>>
>> On 4/20/2022 9:23 AM, Tao Zhou wrote:
>>> Add support for general RAS poison consumption handler.
>>>
>>> Signed-off-by: Tao Zhou <tao.zhou1 at amd.com>
>>> ---
>>> drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c | 43
>> ++++++++++++++++++++++++-
>>> drivers/gpu/drm/amd/amdgpu/amdgpu_ras.h | 1 +
>>> 2 files changed, 43 insertions(+), 1 deletion(-)
>>>
>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c
>>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c
>>> index 2d066cff70ea..4bd3c25be809 100644
>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c
>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c
>>> @@ -1515,6 +1515,44 @@ static int amdgpu_ras_fs_fini(struct
>> amdgpu_device *adev)
>>> /* ras fs end */
>>>
>>> /* ih begin */
>>> +static void ras_interrupt_poison_consumption_handler(struct ras_manager
>> *obj,
>>> + struct amdgpu_iv_entry *entry)
>>> +{
>>> + bool poison_stat = true, need_reset = true;
>>> + struct amdgpu_device *adev = obj->adev;
>>> + struct ras_err_data err_data = {0, 0, 0, NULL};
>>> + struct ras_ih_data *data = &obj->ih_data;
>>> + struct amdgpu_ras_block_object *block_obj =
>>> + amdgpu_ras_get_ras_block(adev, obj->head.block, 0);
>>> +
>>> + if (!adev->gmc.xgmi.connected_to_cpu)
>>> + amdgpu_umc_poison_handler(adev, &err_data, false);
>>> +
>>> + /* Two ways for RAS block's poison consumption implementation:
>>> + * 1. define IH callback, or
>>> + * 2. implement query and consumption interfaces.
>>> + */
>>
>> This doesn't look appropriate. Better to have one standard way. What if an IP
>> has call back implemented to handle errors in non-poison mode?
>
> [Tao] approach 2 is standard way, but some RAS blocks may have different requirements for RAS consumption handler in the future, a callback function is more convenient for this situation.
>
Since it's a future outlook, why can't we enforce 2? i.e., any IP will
need to implement consumption handler for poison consumption interrupts.
Thanks,
Lijo
>>
>>> + if (data->cb) {
>>> + need_reset = !!data->cb(obj->adev, &err_data, entry);
>>> + } else if (block_obj && block_obj->hw_ops) {
>>> + if (block_obj->hw_ops->query_poison_status) {
>>> + poison_stat = block_obj->hw_ops-
>>> query_poison_status(adev);
>>> + if (!poison_stat)
>>> + dev_info(adev->dev, "No RAS poison status
>> in %s poison IH.\n",
>>> + block_obj->ras_comm.name);
>>> + }
>>> +
>>> + if (poison_stat && block_obj->hw_ops-
>>> handle_poison_consumption) {
>>> + poison_stat = block_obj->hw_ops-
>>> handle_poison_consumption(adev);
>>> + need_reset = poison_stat;
>>> + }
>>> + }
>>> +
>>> + /* gpu reset is fallback for all failed cases */
>>> + if (need_reset)
>>> + amdgpu_ras_reset_gpu(adev);
>>> +}
>>> +
>>> static void ras_interrupt_poison_creation_handler(struct ras_manager *obj,
>>> struct amdgpu_iv_entry *entry)
>>> {
>>> @@ -1563,7 +1601,10 @@ static void amdgpu_ras_interrupt_handler(struct
>> ras_manager *obj)
>>> data->rptr = (data->aligned_element_size +
>>> data->rptr) % data->ring_size;
>>>
>>> - ras_interrupt_poison_creation_handler(obj, &entry);
>>> + if (obj->head.block == AMDGPU_RAS_BLOCK__UMC)
>>> + ras_interrupt_poison_creation_handler(obj, &entry);
>>> + else
>>> + ras_interrupt_poison_consumption_handler(obj,
>> &entry);
>>
>> Same problem - poison mode is implicitly assumed. Poison consumption is
>> relevant only in poison mode.
>
> [Tao] will fix it.
>
>>
>> Thanks,
>> Lijo
>>
>>> }
>>> }
>>>
>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.h
>>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.h
>>> index 606df8869b89..c4b61785ab5c 100644
>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.h
>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.h
>>> @@ -509,6 +509,7 @@ struct amdgpu_ras_block_hw_ops {
>>> void (*reset_ras_error_count)(struct amdgpu_device *adev);
>>> void (*reset_ras_error_status)(struct amdgpu_device *adev);
>>> bool (*query_poison_status)(struct amdgpu_device *adev);
>>> + bool (*handle_poison_consumption)(struct amdgpu_device *adev);
>>> };
>>>
>>> /* work flow
>>>
More information about the amd-gfx
mailing list