[PATCH] drm/amdgpu: protect eeprom update from GPU reset

Christian König ckoenig.leichtzumerken at gmail.com
Thu Oct 15 06:50:53 UTC 2020


Looks like the right approach to me as well.

Patch is Reviewed-by: Christian König <christian.koenig at amd.com>.

Regards,
Christian.

Am 14.10.20 um 13:44 schrieb Zhang, Hawking:
> [AMD Public Use]
>
> Thanks for the clarifying, Dennis. So this is kind of race condition between normal GPU reset and ras GPU reset. I 'm fine with the change. The patch is
>
> Reviewed-by: Hawking Zhang <Hawking.Zhang at amd.com>
>
> Regards,
> Hawking
>
> -----Original Message-----
> From: Li, Dennis <Dennis.Li at amd.com>
> Sent: Wednesday, October 14, 2020 18:08
> To: Zhang, Hawking <Hawking.Zhang at amd.com>; amd-gfx at lists.freedesktop.org; Deucher, Alexander <Alexander.Deucher at amd.com>; Kuehling, Felix <Felix.Kuehling at amd.com>; Koenig, Christian <Christian.Koenig at amd.com>
> Subject: RE: [PATCH] drm/amdgpu: protect eeprom update from GPU reset
>
> [AMD Public Use]
>
> Hi, Hawking,
>        Driver has multi-path into GPU reset, so driver couldn't guarantee that bad record update has been done before GPU reset.
>
> Best Regards
> Dennis Li
> -----Original Message-----
> From: Zhang, Hawking <Hawking.Zhang at amd.com>
> Sent: Wednesday, October 14, 2020 5:52 PM
> To: Li, Dennis <Dennis.Li at amd.com>; amd-gfx at lists.freedesktop.org; Deucher, Alexander <Alexander.Deucher at amd.com>; Kuehling, Felix <Felix.Kuehling at amd.com>; Koenig, Christian <Christian.Koenig at amd.com>
> Cc: Li, Dennis <Dennis.Li at amd.com>
> Subject: RE: [PATCH] drm/amdgpu: protect eeprom update from GPU reset
>
> [AMD Public Use]
>
> Hmm, I think bad page record update is done ahead of scheduling gpu reset work. For mGPU case, shall we walk through all the nodes in a hive before issue gpu reset work?
>
> Regards,
> Hawking
>
> -----Original Message-----
> From: Dennis Li <Dennis.Li at amd.com>
> Sent: Wednesday, October 14, 2020 17:41
> To: amd-gfx at lists.freedesktop.org; Deucher, Alexander <Alexander.Deucher at amd.com>; Kuehling, Felix <Felix.Kuehling at amd.com>; Zhang, Hawking <Hawking.Zhang at amd.com>; Koenig, Christian <Christian.Koenig at amd.com>
> Cc: Li, Dennis <Dennis.Li at amd.com>
> Subject: [PATCH] drm/amdgpu: protect eeprom update from GPU reset
>
> because i2c is unstable in GPU reset, driver need protect eeprom update from GPU reset, to not miss any bad page record.
>
> Signed-off-by: Dennis Li <Dennis.Li at amd.com>
>
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras_eeprom.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras_eeprom.c
> index 0e64c39a2372..695bcfc5c983 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras_eeprom.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras_eeprom.c
> @@ -149,7 +149,11 @@ static int __update_table_header(struct amdgpu_ras_eeprom_control *control,
>   
>   	msg.addr = control->i2c_address;
>   
> +	/* i2c may be unstable in gpu reset */
> +	down_read(&adev->reset_sem);
>   	ret = i2c_transfer(&adev->pm.smu_i2c, &msg, 1);
> +	up_read(&adev->reset_sem);
> +
>   	if (ret < 1)
>   		DRM_ERROR("Failed to write EEPROM table header, ret:%d", ret);
>   
> @@ -557,7 +561,11 @@ int amdgpu_ras_eeprom_process_recods(struct amdgpu_ras_eeprom_control *control,
>   		control->next_addr += EEPROM_TABLE_RECORD_SIZE;
>   	}
>   
> +	/* i2c may be unstable in gpu reset */
> +	down_read(&adev->reset_sem);
>   	ret = i2c_transfer(&adev->pm.smu_i2c, msgs, num);
> +	up_read(&adev->reset_sem);
> +
>   	if (ret < 1) {
>   		DRM_ERROR("Failed to process EEPROM table records, ret:%d", ret);
>   
> --
> 2.17.1
> _______________________________________________
> amd-gfx mailing list
> amd-gfx at lists.freedesktop.org
> https://lists.freedesktop.org/mailman/listinfo/amd-gfx



More information about the amd-gfx mailing list