[PATCH] drm/amdgpu: report bad status in GPU recovery
Zhang, Hawking
Hawking.Zhang at amd.com
Thu Aug 1 07:50:04 UTC 2024
[AMD Official Use Only - AMD Internal Distribution Only]
How about replace amdgpu_ras_eeprom_check_err_threshold with ras->is_rma and move amdgpu_ras_eeprom_check_err_threshold to end of gpu recovery as the change you made. In such case we don;'t need to add a paremter in the function just to disable a few messages. And also the message will be the end of gpu recovery as a reminder.
amdgpu_do_asic_reset
if (!amdgpu_ras_eeprom_check_err_threshold(tmp_adev)) {
/* must succeed. */
amdgpu_ras_resume(tmp_adev);
} else {
r = -EINVAL;
goto out;
}
Regards,
Hawking
-----Original Message-----
From: Zhou1, Tao <Tao.Zhou1 at amd.com>
Sent: Thursday, August 1, 2024 13:58
To: Zhang, Hawking <Hawking.Zhang at amd.com>; amd-gfx at lists.freedesktop.org
Subject: RE: [PATCH] drm/amdgpu: report bad status in GPU recovery
[AMD Official Use Only - AMD Internal Distribution Only]
Yes, the bad status message is printed twice with this patch. I think it's harmless and the second message is more convenient for customer.
I can add a parameter for amdgpu_ras_eeprom_check_err_threshold to disable the first message if you think printing message twice is not a good idea.
Tao
> -----Original Message-----
> From: Zhang, Hawking <Hawking.Zhang at amd.com>
> Sent: Thursday, August 1, 2024 1:30 PM
> To: Zhou1, Tao <Tao.Zhou1 at amd.com>; amd-gfx at lists.freedesktop.org
> Subject: RE: [PATCH] drm/amdgpu: report bad status in GPU recovery
>
> [AMD Official Use Only - AMD Internal Distribution Only]
>
> Right, it's functional. My concern is whether the kernel message in
> amdgpu_ras_eeprom_check_err_threshold will be printed twice. This is
> the end of gpu recovery (i.e., report gpu reset failed or gpu reset succeed).
> Check_err_threshold was already done before reaching here.
>
> Regards,
> Hawking
>
> -----Original Message-----
> From: Zhou1, Tao <Tao.Zhou1 at amd.com>
> Sent: Thursday, August 1, 2024 11:49
> To: Zhang, Hawking <Hawking.Zhang at amd.com>;
> amd-gfx at lists.freedesktop.org
> Subject: RE: [PATCH] drm/amdgpu: report bad status in GPU recovery
>
> [AMD Official Use Only - AMD Internal Distribution Only]
>
> I think the if condition in amdgpu_ras_eeprom_check_err_threshold is
> good enough, no need to update it with is_rma.
>
> Tao
>
> > -----Original Message-----
> > From: Zhang, Hawking <Hawking.Zhang at amd.com>
> > Sent: Thursday, August 1, 2024 11:00 AM
> > To: Zhou1, Tao <Tao.Zhou1 at amd.com>; amd-gfx at lists.freedesktop.org
> > Cc: Zhou1, Tao <Tao.Zhou1 at amd.com>
> > Subject: RE: [PATCH] drm/amdgpu: report bad status in GPU recovery
> >
> > [AMD Official Use Only - AMD Internal Distribution Only]
> >
> > Might consider leverage is_RMA flag for the same purpose?
> >
> > Regards,
> > Hawking
> >
> > -----Original Message-----
> > From: amd-gfx <amd-gfx-bounces at lists.freedesktop.org> On Behalf Of
> > Tao Zhou
> > Sent: Wednesday, July 31, 2024 18:05
> > To: amd-gfx at lists.freedesktop.org
> > Cc: Zhou1, Tao <Tao.Zhou1 at amd.com>
> > Subject: [PATCH] drm/amdgpu: report bad status in GPU recovery
> >
> > Instead of printing GPU reset failed.
> >
> > Signed-off-by: Tao Zhou <tao.zhou1 at amd.com>
> > ---
> > drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 9 +++++++--
> > 1 file changed, 7 insertions(+), 2 deletions(-)
> >
> > diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> > b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> > index 355c2478c4b6..b7c967779b4b 100644
> > --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> > +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> > @@ -5933,8 +5933,13 @@ int amdgpu_device_gpu_recover(struct
> > amdgpu_device *adev,
> > tmp_adev->asic_reset_res = 0;
> >
> > if (r) {
> > - /* bad news, how to tell it to userspace ? */
> > - dev_info(tmp_adev->dev, "GPU reset(%d) failed\n",
> > atomic_read(&tmp_adev->gpu_reset_counter));
> > + /* bad news, how to tell it to userspace ?
> > + * for ras error, we should report GPU bad status instead of
> > + * reset failure
> > + */
> > + if (!amdgpu_ras_eeprom_check_err_threshold(tmp_adev))
> > + dev_info(tmp_adev->dev, "GPU
> > + reset(%d) failed\n",
> > +
> > + atomic_read(&tmp_adev->gpu_reset_counter));
> > amdgpu_vf_error_put(tmp_adev,
> > AMDGIM_ERROR_VF_GPU_RESET_FAIL, 0, r);
> > } else {
> > dev_info(tmp_adev->dev, "GPU reset(%d)
> > succeeded!\n", atomic_read(&tmp_adev->gpu_reset_counter));
> > --
> > 2.34.1
> >
>
>
More information about the amd-gfx
mailing list