[PATCH 1/2] drm/amdgpu: refine eeprom data check
Lazar, Lijo
lijo.lazar at amd.com
Mon Jul 7 10:21:02 UTC 2025
On 7/7/2025 2:49 PM, Xie, Patrick wrote:
> [AMD Official Use Only - AMD Internal Distribution Only]
>
> Hi, @Lazar, Lijo, this is a way to ensure no data corruption happens in EEPROM after an EEPROM writing, the return value of amdgpu_ras_save_bad_pages does not indicate anything about this, and we have had a discussion about this, a slight loss in efficiency is acceptable.
I don't it is acceptable during RAS error recovery. A recovery needs to
be initiated at the earliest. The current implementation also has many
handshakes with FW for EEPROM read/writes. If required, the cross-check
may be done after reset which is added with this patch.
Thanks,
Lijo
>
> -----Original Message-----
> From: Lazar, Lijo <Lijo.Lazar at amd.com>
> Sent: Monday, July 7, 2025 4:29 PM
> To: Xie, Patrick <Gangliang.Xie at amd.com>; amd-gfx at lists.freedesktop.org
> Cc: Zhou1, Tao <Tao.Zhou1 at amd.com>
> Subject: Re: [PATCH 1/2] drm/amdgpu: refine eeprom data check
>
>
>
> On 7/7/2025 12:39 PM, ganglxie wrote:
>> add eeprom data checksum check after data writing, before gpu reset,
>> and before driver unload reset eeprom and save correct data to eeprom
>> when check failed
>>
>> Signed-off-by: ganglxie <ganglxie at amd.com>
>> ---
>> drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 1 +
>> drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c | 1 +
>> drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c | 7 +++++-
>> .../gpu/drm/amd/amdgpu/amdgpu_ras_eeprom.c | 25 +++++++++++++++++++
>> .../gpu/drm/amd/amdgpu/amdgpu_ras_eeprom.h | 2 ++
>> drivers/gpu/drm/amd/amdgpu/amdgpu_umc.c | 2 ++
>> 6 files changed, 37 insertions(+), 1 deletion(-)
>>
>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>> index 84fcaf84fead..ecdebca7f3f5 100644
>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>> @@ -6506,6 +6506,7 @@ int amdgpu_device_gpu_recover(struct amdgpu_device *adev,
>> if (!r)
>> drm_dev_wedged_event(adev_to_drm(adev), DRM_WEDGE_RECOVERY_NONE);
>>
>> + amdgpu_ras_eeprom_check_and_recover(adev);
>> return r;
>> }
>>
>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c
>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c
>> index 571b70da4562..1009b60f6ae4 100644
>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c
>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c
>> @@ -2560,6 +2560,7 @@ amdgpu_pci_remove(struct pci_dev *pdev)
>> struct drm_device *dev = pci_get_drvdata(pdev);
>> struct amdgpu_device *adev = drm_to_adev(dev);
>>
>> + amdgpu_ras_eeprom_check_and_recover(adev);
>> amdgpu_xcp_dev_unplug(adev);
>> amdgpu_gmc_prepare_nps_mode_change(adev);
>> drm_dev_unplug(dev);
>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c
>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c
>> index f8a8c8502013..e03550be45b4 100644
>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c
>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c
>> @@ -196,6 +196,7 @@ static int amdgpu_reserve_page_direct(struct amdgpu_device *adev, uint64_t addre
>> amdgpu_ras_add_bad_pages(adev, err_data.err_addr,
>> err_data.err_addr_cnt, false);
>> amdgpu_ras_save_bad_pages(adev, NULL);
>> + amdgpu_ras_eeprom_check_and_recover(adev);
>> }
>>
>> amdgpu_ras_error_data_fini(&err_data);
>> @@ -3539,9 +3540,13 @@ int amdgpu_ras_init_badpage_info(struct amdgpu_device *adev)
>> /* The format action is only applied to new ASICs */
>> if (IP_VERSION_MAJ(amdgpu_ip_version(adev, UMC_HWIP, 0)) >= 12 &&
>> control->tbl_hdr.version < RAS_TABLE_VER_V3)
>> - if (!amdgpu_ras_eeprom_reset_table(control))
>> + if (!amdgpu_ras_eeprom_reset_table(control)) {
>> if (amdgpu_ras_save_bad_pages(adev, NULL))
>> dev_warn(adev->dev, "Failed to format RAS EEPROM data in V3
>> version!\n");
>> + else
>> + amdgpu_ras_eeprom_check_and_recover(adev);
>> + }
>> +
>> }
>>
>> return 0;
>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras_eeprom.c
>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras_eeprom.c
>> index 2af14c369bb9..2458c67526c9 100644
>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras_eeprom.c
>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras_eeprom.c
>> @@ -1522,3 +1522,28 @@ int amdgpu_ras_eeprom_check(struct
>> amdgpu_ras_eeprom_control *control)
>>
>> return res < 0 ? res : 0;
>> }
>> +
>> +void amdgpu_ras_eeprom_check_and_recover(struct amdgpu_device *adev)
>> +{
>> + struct amdgpu_ras *ras = amdgpu_ras_get_context(adev);
>> + struct amdgpu_ras_eeprom_control *control = &ras->eeprom_control;
>> + int res = 0;
>> +
>> + if (!control->is_eeprom_valid)
>> + return;
>> + res = __verify_ras_table_checksum(control);
>> + if (res) {
>> + dev_warn(adev->dev,
>> + "RAS table incorrect checksum or error:%d, try to recover\n",
>> + res);
>> + if (!amdgpu_ras_eeprom_reset_table(control))
>> + if (!amdgpu_ras_save_bad_pages(adev, NULL))
>> + if (!__verify_ras_table_checksum(control)) {
>> + dev_info(adev->dev, "RAS table recovery succeed\n");
>> + return;
>> + }
>> + dev_err(adev->dev, "RAS table recovery failed\n");
>> + control->is_eeprom_valid = false;
>> + }
>> + return;
>> +}
>> \ No newline at end of file
>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras_eeprom.h
>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras_eeprom.h
>> index 35c69ac3dbeb..ebfca4cb5688 100644
>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras_eeprom.h
>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras_eeprom.h
>> @@ -161,6 +161,8 @@ void amdgpu_ras_debugfs_set_ret_size(struct
>> amdgpu_ras_eeprom_control *control);
>>
>> int amdgpu_ras_eeprom_check(struct amdgpu_ras_eeprom_control
>> *control);
>>
>> +void amdgpu_ras_eeprom_check_and_recover(struct amdgpu_device *adev);
>> +
>> extern const struct file_operations
>> amdgpu_ras_debugfs_eeprom_size_ops;
>> extern const struct file_operations
>> amdgpu_ras_debugfs_eeprom_table_ops;
>>
>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_umc.c
>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_umc.c
>> index bfc86f1e84e5..f36fe46541ed 100644
>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_umc.c
>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_umc.c
>> @@ -80,6 +80,7 @@ int amdgpu_umc_page_retirement_mca(struct amdgpu_device *adev,
>> amdgpu_ras_add_bad_pages(adev, err_data.err_addr,
>> err_data.err_addr_cnt, false);
>> amdgpu_ras_save_bad_pages(adev, NULL);
>> + amdgpu_ras_eeprom_check_and_recover(adev);
>> }
>>
>> out_free_err_addr:
>> @@ -168,6 +169,7 @@ void amdgpu_umc_handle_bad_pages(struct amdgpu_device *adev,
>> amdgpu_ras_add_bad_pages(adev, err_data->err_addr,
>> err_data->err_addr_cnt, false);
>> amdgpu_ras_save_bad_pages(adev, &err_count);
>> + amdgpu_ras_eeprom_check_and_recover(adev);
>
> This doesn't look optimal. Reading the entire EEPROM each time before going for RAS recovery is not ideal. At minimum it should have a check whether the save failed, and then consider saving them later after the reset.
>
> Thanks,
> Lijo
>>
>> amdgpu_dpm_send_hbm_bad_pages_num(adev,
>> con->eeprom_control.ras_num_bad_pages);
>
More information about the amd-gfx
mailing list