[PATCH] drm/amdgpu: report bad status in GPU recovery

Lazar, Lijo lijo.lazar at amd.com
Thu Aug 1 03:54:16 UTC 2024



On 8/1/2024 9:17 AM, Zhou1, Tao wrote:
> [AMD Official Use Only - AMD Internal Distribution Only]
> 
>> -----Original Message-----
>> From: Lazar, Lijo <Lijo.Lazar at amd.com>
>> Sent: Wednesday, July 31, 2024 9:31 PM
>> To: Zhou1, Tao <Tao.Zhou1 at amd.com>; amd-gfx at lists.freedesktop.org
>> Subject: Re: [PATCH] drm/amdgpu: report bad status in GPU recovery
>>
>>
>>
>> On 7/31/2024 3:35 PM, Tao Zhou wrote:
>>> Instead of printing GPU reset failed.
>>>
>>> Signed-off-by: Tao Zhou <tao.zhou1 at amd.com>
>>> ---
>>>  drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 9 +++++++--
>>>  1 file changed, 7 insertions(+), 2 deletions(-)
>>>
>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>>> index 355c2478c4b6..b7c967779b4b 100644
>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>>> @@ -5933,8 +5933,13 @@ int amdgpu_device_gpu_recover(struct
>> amdgpu_device *adev,
>>>             tmp_adev->asic_reset_res = 0;
>>>
>>>             if (r) {
>>> -                   /* bad news, how to tell it to userspace ? */
>>> -                   dev_info(tmp_adev->dev, "GPU reset(%d) failed\n",
>> atomic_read(&tmp_adev->gpu_reset_counter));
>>> +                   /* bad news, how to tell it to userspace ?
>>> +                    * for ras error, we should report GPU bad status instead
>> of
>>> +                    * reset failure
>>> +                    */
>>> +                   if
>> (!amdgpu_ras_eeprom_check_err_threshold(tmp_adev))
>>> +                           dev_info(tmp_adev->dev, "GPU reset(%d)
>> failed\n",
>>> +                                   atomic_read(&tmp_adev-
>>> gpu_reset_counter));
>>
>> Better to check reset_context.src == AMDGPU_RESET_SRC_RAS to confirm that
>> the reset is indeed triggered due to ras error.
> 
> [Tao] It seems AMDGPU_RESET_SRC_RAS is not used currently, I will set it before use the flag.
> 

It's set here -
https://elixir.bootlin.com/linux/v6.11-rc1/source/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c#L2607

Thanks,
Lijo

>>
>> Thanks,
>> Lijo
>>
>>>                     amdgpu_vf_error_put(tmp_adev,
>> AMDGIM_ERROR_VF_GPU_RESET_FAIL, 0, r);
>>>             } else {
>>>                     dev_info(tmp_adev->dev, "GPU reset(%d)
>> succeeded!\n",
>>> atomic_read(&tmp_adev->gpu_reset_counter));


More information about the amd-gfx mailing list