[PATCH 1/2] drm/amdgpu/ras: Extract amdgpu_ras_save_bad_pages from amdgpu_ras_reserve_bad_pages

Thu Nov 14 16:41:51 UTC 2019

On 11/13/19 10:09 PM, Zhou1, Tao wrote:
> Two questions:
>
> 1. "we lose all reservation during ASIC reset"
> Are you sure of it? I remember the content of vram may be lost after reset but the allocated status could be reserved.

Yea, now that I am thinking of it I think i might have confused it with 
BO content recovery in amdgpu_device_recover_vram for shadow buffers 
which are page tables only but just for VRAM reservation  status this 
might be irrelevant... Christian - can you confirm Tao is correct on this ?

>
> 2. You change the bad page handle flow from:
>
> detect bad page -> reserve vram for bad page -> save bad page info to eeprom -> gpu reset
>
> to:
>
> detect bad page (this time) -> save bad page (last time) info to eeprom -> gpu reset -> reserve vram for bad page (this time)

Even though if I am wrong on the first point this is irrelevant but 
still - Why saving bad page is from last time ? See 
https://cgit.freedesktop.org/~agd5f/linux/tree/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c?h=amd-staging-drm-next#n1436 
- the save count is the latest so as the content of 
data->bps[control->num_recs] up to data->bps[control->num_recs + 
save_count] as those are updated in 
amdgpu_umc_process_ras_data_cb->amdgpu_ras_add_bad_pages which is called 
right BEFORE amdgpu_ras_reset_gpu->amdgpu_ras_save_bad_pages in the 
interrupt sequence

Andrey

>
> Is that right?  Saving bad page (this time) info to eeprom is delayed to the next time that bad page is detected? But the time of detecting bad page is random.
> I think the bad page info would be lost without saving to eeprom if power off occurs.
>
> detect bad page (this time) -> save bad page (last time) info to eeprom -> gpu reset -> reserve vram for bad page (this time) -> poweroff/system reset (and bad page info (this time) is lost)
>
> Tao
>
>> -----Original Message-----
>> From: amd-gfx <amd-gfx-bounces at lists.freedesktop.org> On Behalf Of
>> Andrey Grodzovsky
>> Sent: 2019年11月14日 6:45
>> To: amd-gfx at lists.freedesktop.org
>> Cc: alexdeucher at gmail.com; Grodzovsky, Andrey
>> <Andrey.Grodzovsky at amd.com>; Chen, Guchun <Guchun.Chen at amd.com>;
>> Zhang, Hawking <Hawking.Zhang at amd.com>
>> Subject: [PATCH 1/2] drm/amdgpu/ras: Extract
>> amdgpu_ras_save_bad_pages from amdgpu_ras_reserve_bad_pages
>>
>> We want to be be able to call them independently from one another so that
>> before GPU reset just amdgpu_ras_save_bad_pages is called.
>>
>> Signed-off-by: Andrey Grodzovsky <andrey.grodzovsky at amd.com>
>> ---
>>   drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c | 14 ++++----------
>> drivers/gpu/drm/amd/amdgpu/amdgpu_ras.h |  4 +++-
>>   2 files changed, 7 insertions(+), 11 deletions(-)
>>
>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c
>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c
>> index 4044834..600a86d 100644
>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c
>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c
>> @@ -1421,7 +1421,7 @@ int amdgpu_ras_add_bad_pages(struct
>> amdgpu_device *adev,
>>    * write error record array to eeprom, the function should be
>>    * protected by recovery_lock
>>    */
>> -static int amdgpu_ras_save_bad_pages(struct amdgpu_device *adev)
>> +int amdgpu_ras_save_bad_pages(struct amdgpu_device *adev)
>>   {
>>   	struct amdgpu_ras *con = amdgpu_ras_get_context(adev);
>>   	struct ras_err_handler_data *data;
>> @@ -1520,7 +1520,7 @@ int amdgpu_ras_reserve_bad_pages(struct
>> amdgpu_device *adev)
>>   	struct ras_err_handler_data *data;
>>   	uint64_t bp;
>>   	struct amdgpu_bo *bo = NULL;
>> -	int i, ret = 0;
>> +	int i;
>>
>>   	if (!con || !con->eh_data)
>>   		return 0;
>> @@ -1548,12 +1548,9 @@ int amdgpu_ras_reserve_bad_pages(struct
>> amdgpu_device *adev)
>>   		data->last_reserved = i + 1;
>>   		bo = NULL;
>>   	}
>> -
>> -	/* continue to save bad pages to eeprom even reesrve_vram fails */
>> -	ret = amdgpu_ras_save_bad_pages(adev);
>>   out:
>>   	mutex_unlock(&con->recovery_lock);
>> -	return ret;
>> +	return 0;
>>   }
>>
>>   /* called when driver unload */
>> @@ -1615,14 +1612,11 @@ int amdgpu_ras_recovery_init(struct
>> amdgpu_device *adev)
>>   		ret = amdgpu_ras_load_bad_pages(adev);
>>   		if (ret)
>>   			goto free;
>> -		ret = amdgpu_ras_reserve_bad_pages(adev);
>> -		if (ret)
>> -			goto release;
>> +		amdgpu_ras_reserve_bad_pages(adev);
>>   	}
>>
>>   	return 0;
>>
>> -release:
>>   	amdgpu_ras_release_bad_pages(adev);
>>   free:
>>   	kfree((*data)->bps);
>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.h
>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.h
>> index f80fd34..12b0797 100644
>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.h
>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.h
>> @@ -492,6 +492,8 @@ unsigned long
>> amdgpu_ras_query_error_count(struct amdgpu_device *adev,  int
>> amdgpu_ras_add_bad_pages(struct amdgpu_device *adev,
>>   		struct eeprom_table_record *bps, int pages);
>>
>> +int amdgpu_ras_save_bad_pages(struct amdgpu_device *adev);
>> +
>>   int amdgpu_ras_reserve_bad_pages(struct amdgpu_device *adev);
>>
>>   static inline int amdgpu_ras_reset_gpu(struct amdgpu_device *adev, @@ -
>> 503,7 +505,7 @@ static inline int amdgpu_ras_reset_gpu(struct
>> amdgpu_device *adev,
>>   	 * i2c may be unstable in gpu reset
>>   	 */
>>   	if (in_task())
>> -		amdgpu_ras_reserve_bad_pages(adev);
>> +		amdgpu_ras_save_bad_pages(adev);
>>
>>   	if (atomic_cmpxchg(&ras->in_recovery, 0, 1) == 0)
>>   		schedule_work(&ras->recovery_work);
>> --
>> 2.7.4
>>
>> _______________________________________________
>> amd-gfx mailing list
>> amd-gfx at lists.freedesktop.org
>> https://lists.freedesktop.org/mailman/listinfo/amd-gfx