[PATCH 1/2] drm/amdgpu/ras: Extract amdgpu_ras_save_bad_pages from amdgpu_ras_reserve_bad_pages

Thu Nov 14 03:09:45 UTC 2019

Two questions:

1. "we lose all reservation during ASIC reset"
Are you sure of it? I remember the content of vram may be lost after reset but the allocated status could be reserved.

2. You change the bad page handle flow from:

detect bad page -> reserve vram for bad page -> save bad page info to eeprom -> gpu reset

to:

detect bad page (this time) -> save bad page (last time) info to eeprom -> gpu reset -> reserve vram for bad page (this time)

Is that right?  Saving bad page (this time) info to eeprom is delayed to the next time that bad page is detected? But the time of detecting bad page is random.
I think the bad page info would be lost without saving to eeprom if power off occurs.

detect bad page (this time) -> save bad page (last time) info to eeprom -> gpu reset -> reserve vram for bad page (this time) -> poweroff/system reset (and bad page info (this time) is lost)

Tao

> -----Original Message-----
> From: amd-gfx <amd-gfx-bounces at lists.freedesktop.org> On Behalf Of
> Andrey Grodzovsky
> Sent: 2019年11月14日 6:45
> To: amd-gfx at lists.freedesktop.org
> Cc: alexdeucher at gmail.com; Grodzovsky, Andrey
> <Andrey.Grodzovsky at amd.com>; Chen, Guchun <Guchun.Chen at amd.com>;
> Zhang, Hawking <Hawking.Zhang at amd.com>
> Subject: [PATCH 1/2] drm/amdgpu/ras: Extract
> amdgpu_ras_save_bad_pages from amdgpu_ras_reserve_bad_pages
> 
> We want to be be able to call them independently from one another so that
> before GPU reset just amdgpu_ras_save_bad_pages is called.
> 
> Signed-off-by: Andrey Grodzovsky <andrey.grodzovsky at amd.com>
> ---
>  drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c | 14 ++++----------
> drivers/gpu/drm/amd/amdgpu/amdgpu_ras.h |  4 +++-
>  2 files changed, 7 insertions(+), 11 deletions(-)
> 
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c
> b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c
> index 4044834..600a86d 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c
> @@ -1421,7 +1421,7 @@ int amdgpu_ras_add_bad_pages(struct
> amdgpu_device *adev,
>   * write error record array to eeprom, the function should be
>   * protected by recovery_lock
>   */
> -static int amdgpu_ras_save_bad_pages(struct amdgpu_device *adev)
> +int amdgpu_ras_save_bad_pages(struct amdgpu_device *adev)
>  {
>  	struct amdgpu_ras *con = amdgpu_ras_get_context(adev);
>  	struct ras_err_handler_data *data;
> @@ -1520,7 +1520,7 @@ int amdgpu_ras_reserve_bad_pages(struct
> amdgpu_device *adev)
>  	struct ras_err_handler_data *data;
>  	uint64_t bp;
>  	struct amdgpu_bo *bo = NULL;
> -	int i, ret = 0;
> +	int i;
> 
>  	if (!con || !con->eh_data)
>  		return 0;
> @@ -1548,12 +1548,9 @@ int amdgpu_ras_reserve_bad_pages(struct
> amdgpu_device *adev)
>  		data->last_reserved = i + 1;
>  		bo = NULL;
>  	}
> -
> -	/* continue to save bad pages to eeprom even reesrve_vram fails */
> -	ret = amdgpu_ras_save_bad_pages(adev);
>  out:
>  	mutex_unlock(&con->recovery_lock);
> -	return ret;
> +	return 0;
>  }
> 
>  /* called when driver unload */
> @@ -1615,14 +1612,11 @@ int amdgpu_ras_recovery_init(struct
> amdgpu_device *adev)
>  		ret = amdgpu_ras_load_bad_pages(adev);
>  		if (ret)
>  			goto free;
> -		ret = amdgpu_ras_reserve_bad_pages(adev);
> -		if (ret)
> -			goto release;
> +		amdgpu_ras_reserve_bad_pages(adev);
>  	}
> 
>  	return 0;
> 
> -release:
>  	amdgpu_ras_release_bad_pages(adev);
>  free:
>  	kfree((*data)->bps);
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.h
> b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.h
> index f80fd34..12b0797 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.h
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.h
> @@ -492,6 +492,8 @@ unsigned long
> amdgpu_ras_query_error_count(struct amdgpu_device *adev,  int
> amdgpu_ras_add_bad_pages(struct amdgpu_device *adev,
>  		struct eeprom_table_record *bps, int pages);
> 
> +int amdgpu_ras_save_bad_pages(struct amdgpu_device *adev);
> +
>  int amdgpu_ras_reserve_bad_pages(struct amdgpu_device *adev);
> 
>  static inline int amdgpu_ras_reset_gpu(struct amdgpu_device *adev, @@ -
> 503,7 +505,7 @@ static inline int amdgpu_ras_reset_gpu(struct
> amdgpu_device *adev,
>  	 * i2c may be unstable in gpu reset
>  	 */
>  	if (in_task())
> -		amdgpu_ras_reserve_bad_pages(adev);
> +		amdgpu_ras_save_bad_pages(adev);
> 
>  	if (atomic_cmpxchg(&ras->in_recovery, 0, 1) == 0)
>  		schedule_work(&ras->recovery_work);
> --
> 2.7.4
> 
> _______________________________________________
> amd-gfx mailing list
> amd-gfx at lists.freedesktop.org
> https://lists.freedesktop.org/mailman/listinfo/amd-gfx