[PATCH] drm/amdgpu: report bad status in GPU recovery
Zhou1, Tao
Tao.Zhou1 at amd.com
Thu Aug 1 07:00:14 UTC 2024
[AMD Official Use Only - AMD Internal Distribution Only]
We need to perform gpu reset for HW and only make the reset flow failing from driver perspective.
Tao
> -----Original Message-----
> From: Lazar, Lijo <Lijo.Lazar at amd.com>
> Sent: Thursday, August 1, 2024 2:41 PM
> To: Zhou1, Tao <Tao.Zhou1 at amd.com>; Zhang, Hawking
> <Hawking.Zhang at amd.com>; amd-gfx at lists.freedesktop.org
> Subject: Re: [PATCH] drm/amdgpu: report bad status in GPU recovery
>
>
>
> On 8/1/2024 11:28 AM, Zhou1, Tao wrote:
> > [AMD Official Use Only - AMD Internal Distribution Only]
> >
> > [AMD Official Use Only - AMD Internal Distribution Only]
> >
> > Yes, the bad status message is printed twice with this patch. I think it's harmless
> and the second message is more convenient for customer.
> >
> > I can add a parameter for amdgpu_ras_eeprom_check_err_threshold to disable
> the first message if you think printing message twice is not a good idea.
> >
>
> Instead of this way, can't this be added to amdgpu_ras_do_recovery() and stop all
> recovery actions?
>
> Thanks,
> Lijo
>
> > Tao
> >
> >> -----Original Message-----
> >> From: Zhang, Hawking <Hawking.Zhang at amd.com>
> >> Sent: Thursday, August 1, 2024 1:30 PM
> >> To: Zhou1, Tao <Tao.Zhou1 at amd.com>; amd-gfx at lists.freedesktop.org
> >> Subject: RE: [PATCH] drm/amdgpu: report bad status in GPU recovery
> >>
> >> [AMD Official Use Only - AMD Internal Distribution Only]
> >>
> >> Right, it's functional. My concern is whether the kernel message in
> >> amdgpu_ras_eeprom_check_err_threshold will be printed twice. This is
> >> the end of gpu recovery (i.e., report gpu reset failed or gpu reset succeed).
> >> Check_err_threshold was already done before reaching here.
> >>
> >> Regards,
> >> Hawking
> >>
> >> -----Original Message-----
> >> From: Zhou1, Tao <Tao.Zhou1 at amd.com>
> >> Sent: Thursday, August 1, 2024 11:49
> >> To: Zhang, Hawking <Hawking.Zhang at amd.com>;
> >> amd-gfx at lists.freedesktop.org
> >> Subject: RE: [PATCH] drm/amdgpu: report bad status in GPU recovery
> >>
> >> [AMD Official Use Only - AMD Internal Distribution Only]
> >>
> >> I think the if condition in amdgpu_ras_eeprom_check_err_threshold is
> >> good enough, no need to update it with is_rma.
> >>
> >> Tao
> >>
> >>> -----Original Message-----
> >>> From: Zhang, Hawking <Hawking.Zhang at amd.com>
> >>> Sent: Thursday, August 1, 2024 11:00 AM
> >>> To: Zhou1, Tao <Tao.Zhou1 at amd.com>; amd-gfx at lists.freedesktop.org
> >>> Cc: Zhou1, Tao <Tao.Zhou1 at amd.com>
> >>> Subject: RE: [PATCH] drm/amdgpu: report bad status in GPU recovery
> >>>
> >>> [AMD Official Use Only - AMD Internal Distribution Only]
> >>>
> >>> Might consider leverage is_RMA flag for the same purpose?
> >>>
> >>> Regards,
> >>> Hawking
> >>>
> >>> -----Original Message-----
> >>> From: amd-gfx <amd-gfx-bounces at lists.freedesktop.org> On Behalf Of
> >>> Tao Zhou
> >>> Sent: Wednesday, July 31, 2024 18:05
> >>> To: amd-gfx at lists.freedesktop.org
> >>> Cc: Zhou1, Tao <Tao.Zhou1 at amd.com>
> >>> Subject: [PATCH] drm/amdgpu: report bad status in GPU recovery
> >>>
> >>> Instead of printing GPU reset failed.
> >>>
> >>> Signed-off-by: Tao Zhou <tao.zhou1 at amd.com>
> >>> ---
> >>> drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 9 +++++++--
> >>> 1 file changed, 7 insertions(+), 2 deletions(-)
> >>>
> >>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> >>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> >>> index 355c2478c4b6..b7c967779b4b 100644
> >>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> >>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> >>> @@ -5933,8 +5933,13 @@ int amdgpu_device_gpu_recover(struct
> >>> amdgpu_device *adev,
> >>> tmp_adev->asic_reset_res = 0;
> >>>
> >>> if (r) {
> >>> - /* bad news, how to tell it to userspace ? */
> >>> - dev_info(tmp_adev->dev, "GPU reset(%d) failed\n",
> >>> atomic_read(&tmp_adev->gpu_reset_counter));
> >>> + /* bad news, how to tell it to userspace ?
> >>> + * for ras error, we should report GPU bad status instead of
> >>> + * reset failure
> >>> + */
> >>> + if (!amdgpu_ras_eeprom_check_err_threshold(tmp_adev))
> >>> + dev_info(tmp_adev->dev, "GPU
> >>> + reset(%d) failed\n",
> >>> +
> >>> + atomic_read(&tmp_adev->gpu_reset_counter));
> >>> amdgpu_vf_error_put(tmp_adev,
> >>> AMDGIM_ERROR_VF_GPU_RESET_FAIL, 0, r);
> >>> } else {
> >>> dev_info(tmp_adev->dev, "GPU reset(%d)
> >>> succeeded!\n", atomic_read(&tmp_adev->gpu_reset_counter));
> >>> --
> >>> 2.34.1
> >>>
> >>
> >>
> >
More information about the amd-gfx
mailing list