[PATCH 1/2] drm/amdgpu: refine eeprom data check
Zhou1, Tao
Tao.Zhou1 at amd.com
Fri Jul 11 08:06:31 UTC 2025
[AMD Official Use Only - AMD Internal Distribution Only]
It's better to add ras support check, ras is not supported on any ASIC.
Tao
> -----Original Message-----
> From: Xie, Patrick <Gangliang.Xie at amd.com>
> Sent: Friday, July 11, 2025 3:47 PM
> To: Lazar, Lijo <Lijo.Lazar at amd.com>; amd-gfx at lists.freedesktop.org
> Cc: Zhou1, Tao <Tao.Zhou1 at amd.com>
> Subject: RE: [PATCH 1/2] drm/amdgpu: refine eeprom data check
>
> [AMD Official Use Only - AMD Internal Distribution Only]
>
> Thanks, will add this NULL check
>
> -----Original Message-----
> From: Lazar, Lijo <Lijo.Lazar at amd.com>
> Sent: Friday, July 11, 2025 3:17 PM
> To: Xie, Patrick <Gangliang.Xie at amd.com>; amd-gfx at lists.freedesktop.org
> Cc: Zhou1, Tao <Tao.Zhou1 at amd.com>
> Subject: Re: [PATCH 1/2] drm/amdgpu: refine eeprom data check
>
>
>
> On 7/11/2025 8:10 AM, ganglxie wrote:
> > add eeprom data checksum check before driver unload. reset eeprom and
> > save correct data to eeprom when check failed
> >
> > Signed-off-by: ganglxie <ganglxie at amd.com>
> > ---
> > drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c | 1 +
> > .../gpu/drm/amd/amdgpu/amdgpu_ras_eeprom.c | 25 +++++++++++++++++++
> > .../gpu/drm/amd/amdgpu/amdgpu_ras_eeprom.h | 2 ++
> > 3 files changed, 28 insertions(+)
> >
> > diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c
> > b/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c
> > index 571b70da4562..1009b60f6ae4 100644
> > --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c
> > +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c
> > @@ -2560,6 +2560,7 @@ amdgpu_pci_remove(struct pci_dev *pdev)
> > struct drm_device *dev = pci_get_drvdata(pdev);
> > struct amdgpu_device *adev = drm_to_adev(dev);
> >
> > + amdgpu_ras_eeprom_check_and_recover(adev);
> > amdgpu_xcp_dev_unplug(adev);
> > amdgpu_gmc_prepare_nps_mode_change(adev);
> > drm_dev_unplug(dev);
> > diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras_eeprom.c
> > b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras_eeprom.c
> > index 2af14c369bb9..2458c67526c9 100644
> > --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras_eeprom.c
> > +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras_eeprom.c
> > @@ -1522,3 +1522,28 @@ int amdgpu_ras_eeprom_check(struct
> > amdgpu_ras_eeprom_control *control)
> >
> > return res < 0 ? res : 0;
> > }
> > +
> > +void amdgpu_ras_eeprom_check_and_recover(struct amdgpu_device *adev)
> > +{
> > + struct amdgpu_ras *ras = amdgpu_ras_get_context(adev);
>
> Doesn't this require a NULL check?
>
> Thanks,
> Lijo
>
> > + struct amdgpu_ras_eeprom_control *control = &ras->eeprom_control;
> > + int res = 0;
> > +
> > + if (!control->is_eeprom_valid)
> > + return;
> > + res = __verify_ras_table_checksum(control);
> > + if (res) {
> > + dev_warn(adev->dev,
> > + "RAS table incorrect checksum or error:%d, try to recover\n",
> > + res);
> > + if (!amdgpu_ras_eeprom_reset_table(control))
> > + if (!amdgpu_ras_save_bad_pages(adev, NULL))
> > + if (!__verify_ras_table_checksum(control)) {
> > + dev_info(adev->dev, "RAS table recovery succeed\n");
> > + return;
> > + }
> > + dev_err(adev->dev, "RAS table recovery failed\n");
> > + control->is_eeprom_valid = false;
> > + }
> > + return;
> > +}
> > \ No newline at end of file
> > diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras_eeprom.h
> > b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras_eeprom.h
> > index 35c69ac3dbeb..ebfca4cb5688 100644
> > --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras_eeprom.h
> > +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras_eeprom.h
> > @@ -161,6 +161,8 @@ void amdgpu_ras_debugfs_set_ret_size(struct
> > amdgpu_ras_eeprom_control *control);
> >
> > int amdgpu_ras_eeprom_check(struct amdgpu_ras_eeprom_control
> > *control);
> >
> > +void amdgpu_ras_eeprom_check_and_recover(struct amdgpu_device *adev);
> > +
> > extern const struct file_operations
> > amdgpu_ras_debugfs_eeprom_size_ops;
> > extern const struct file_operations
> > amdgpu_ras_debugfs_eeprom_table_ops;
> >
>
More information about the amd-gfx
mailing list