[PATCH] drm/amdkfd: print unmap queue status for RAS poison consumption (v2)

Felix Kuehling felix.kuehling at amd.com
Tue Mar 22 14:05:09 UTC 2022


Am 2022-03-21 um 23:17 schrieb Zhou1, Tao:
> [AMD Official Use Only]
>
>
>
>> -----Original Message-----
>> From: Lazar, Lijo <Lijo.Lazar at amd.com>
>> Sent: Monday, March 21, 2022 7:21 PM
>> To: Zhou1, Tao <Tao.Zhou1 at amd.com>; amd-gfx at lists.freedesktop.org; Zhang,
>> Hawking <Hawking.Zhang at amd.com>; Kuehling, Felix
>> <Felix.Kuehling at amd.com>; Yang, Stanley <Stanley.Yang at amd.com>; Chai,
>> Thomas <YiPeng.Chai at amd.com>
>> Subject: Re: [PATCH] drm/amdkfd: print unmap queue status for RAS poison
>> consumption (v2)
>>
>>
>>
>> On 3/21/2022 3:08 PM, Tao Zhou wrote:
>>> Print the status out when it passes, and also tell user gpu reset is
>>> triggered when we fallback to legacy way.
>>>
>>> v2: make the message more explicitly.
>>>
>>> Signed-off-by: Tao Zhou <tao.zhou1 at amd.com>
>>> ---
>>>    drivers/gpu/drm/amd/amdkfd/kfd_int_process_v9.c | 11 +++++++----
>>>    1 file changed, 7 insertions(+), 4 deletions(-)
>>>
>>> diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_int_process_v9.c
>>> b/drivers/gpu/drm/amd/amdkfd/kfd_int_process_v9.c
>>> index 56902b5bb7b6..32c451f21db7 100644
>>> --- a/drivers/gpu/drm/amd/amdkfd/kfd_int_process_v9.c
>>> +++ b/drivers/gpu/drm/amd/amdkfd/kfd_int_process_v9.c
>>> @@ -105,8 +105,6 @@ static void
>> event_interrupt_poison_consumption(struct kfd_dev *dev,
>>>    	if (old_poison)
>>>    		return;
>>>
>>> -	pr_warn("RAS poison consumption handling: client id %d\n", client_id);
>>> -
>>>    	switch (client_id) {
>>>    	case SOC15_IH_CLIENTID_SE0SH:
>>>    	case SOC15_IH_CLIENTID_SE1SH:
>>> @@ -130,10 +128,15 @@ static void
>> event_interrupt_poison_consumption(struct kfd_dev *dev,
>>>    	/* resetting queue passes, do page retirement without gpu reset
>>>    	 * resetting queue fails, fallback to gpu reset solution
>>>    	 */
>>> -	if (!ret)
>>> +	if (!ret) {
>>> +		pr_warn("RAS poison consumption, unmap queue flow succeeds:
>> client id %d\n",
>>> +				client_id);
>> As discussed in another patch, I understand that pr_* is the legacy usage in the
>> file. But it won't be helpful for this case with multiple devices. Would suggest to
>> change to dev_info() - the message here and below seems informational about
>> the handling of this situation rather than warning of something bad.
>>
>> Thanks,
>> Lijo
> [Tao] I'll replace pr_warn with dev_info. I think we need a dedicated cleanup to retire all pr format message in amdgpu.
> RAS poison consumption is a special event should be paid attention to, I think a waning is also reasonable.

Or you could make the "unmap success" case a dev_info and the "gpu 
reset" case a dev_warn.

Either way, v3 of your patch looks good to me and is

Acked-by: Felix Kuehling <Felix.Kuehling at amd.com>

Regards,
   Felix


>
>>>    		amdgpu_amdkfd_ras_poison_consumption_handler(dev->adev,
>> false);
>>> -	else
>>> +	} else {
>>> +		pr_warn("RAS poison consumption, fallback to gpu reset flow:
>> client id %d\n",
>>> +				client_id);
>>>    		amdgpu_amdkfd_ras_poison_consumption_handler(dev->adev,
>> true);
>>> +	}
>>>    }
>>>
>>>    static bool event_interrupt_isr_v9(struct kfd_dev *dev,
>>>


More information about the amd-gfx mailing list