[PATCH] drm/amdgpu: report bad status in GPU recovery
Lazar, Lijo
lijo.lazar at amd.com
Thu Aug 1 03:54:16 UTC 2024
On 8/1/2024 9:17 AM, Zhou1, Tao wrote:
> [AMD Official Use Only - AMD Internal Distribution Only]
>
>> -----Original Message-----
>> From: Lazar, Lijo <Lijo.Lazar at amd.com>
>> Sent: Wednesday, July 31, 2024 9:31 PM
>> To: Zhou1, Tao <Tao.Zhou1 at amd.com>; amd-gfx at lists.freedesktop.org
>> Subject: Re: [PATCH] drm/amdgpu: report bad status in GPU recovery
>>
>>
>>
>> On 7/31/2024 3:35 PM, Tao Zhou wrote:
>>> Instead of printing GPU reset failed.
>>>
>>> Signed-off-by: Tao Zhou <tao.zhou1 at amd.com>
>>> ---
>>> drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 9 +++++++--
>>> 1 file changed, 7 insertions(+), 2 deletions(-)
>>>
>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>>> index 355c2478c4b6..b7c967779b4b 100644
>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>>> @@ -5933,8 +5933,13 @@ int amdgpu_device_gpu_recover(struct
>> amdgpu_device *adev,
>>> tmp_adev->asic_reset_res = 0;
>>>
>>> if (r) {
>>> - /* bad news, how to tell it to userspace ? */
>>> - dev_info(tmp_adev->dev, "GPU reset(%d) failed\n",
>> atomic_read(&tmp_adev->gpu_reset_counter));
>>> + /* bad news, how to tell it to userspace ?
>>> + * for ras error, we should report GPU bad status instead
>> of
>>> + * reset failure
>>> + */
>>> + if
>> (!amdgpu_ras_eeprom_check_err_threshold(tmp_adev))
>>> + dev_info(tmp_adev->dev, "GPU reset(%d)
>> failed\n",
>>> + atomic_read(&tmp_adev-
>>> gpu_reset_counter));
>>
>> Better to check reset_context.src == AMDGPU_RESET_SRC_RAS to confirm that
>> the reset is indeed triggered due to ras error.
>
> [Tao] It seems AMDGPU_RESET_SRC_RAS is not used currently, I will set it before use the flag.
>
It's set here -
https://elixir.bootlin.com/linux/v6.11-rc1/source/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c#L2607
Thanks,
Lijo
>>
>> Thanks,
>> Lijo
>>
>>> amdgpu_vf_error_put(tmp_adev,
>> AMDGIM_ERROR_VF_GPU_RESET_FAIL, 0, r);
>>> } else {
>>> dev_info(tmp_adev->dev, "GPU reset(%d)
>> succeeded!\n",
>>> atomic_read(&tmp_adev->gpu_reset_counter));
More information about the amd-gfx
mailing list