[PATCH 1/2] drm/amdgpu/ras: Extract amdgpu_ras_save_bad_pages from amdgpu_ras_reserve_bad_pages

Mon Nov 18 14:05:26 UTC 2019

Christian - ping.

Andrey

On 11/14/19 11:41 AM, Andrey Grodzovsky wrote:
>
> On 11/13/19 10:09 PM, Zhou1, Tao wrote:
>> Two questions:
>>
>> 1. "we lose all reservation during ASIC reset"
>> Are you sure of it? I remember the content of vram may be lost after 
>> reset but the allocated status could be reserved.
>
> Yea, now that I am thinking of it I think i might have confused it 
> with BO content recovery in amdgpu_device_recover_vram for shadow 
> buffers which are page tables only but just for VRAM reservation 
> status this might be irrelevant... Christian - can you confirm Tao is 
> correct on this ?
>
>>
>> 2. You change the bad page handle flow from:
>>
>> detect bad page -> reserve vram for bad page -> save bad page info to 
>> eeprom -> gpu reset
>>
>> to:
>>
>> detect bad page (this time) -> save bad page (last time) info to 
>> eeprom -> gpu reset -> reserve vram for bad page (this time)
>
> Even though if I am wrong on the first point this is irrelevant but 
> still - Why saving bad page is from last time ? See 
> https://cgit.freedesktop.org/~agd5f/linux/tree/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c?h=amd-staging-drm-next#n1436 
> - the save count is the latest so as the content of 
> data->bps[control->num_recs] up to data->bps[control->num_recs + 
> save_count] as those are updated in 
> amdgpu_umc_process_ras_data_cb->amdgpu_ras_add_bad_pages which is 
> called right BEFORE amdgpu_ras_reset_gpu->amdgpu_ras_save_bad_pages in 
> the interrupt sequence
>
> Andrey
>
>
>>
>> Is that right?  Saving bad page (this time) info to eeprom is delayed 
>> to the next time that bad page is detected? But the time of detecting 
>> bad page is random.
>> I think the bad page info would be lost without saving to eeprom if 
>> power off occurs.
>>
>> detect bad page (this time) -> save bad page (last time) info to 
>> eeprom -> gpu reset -> reserve vram for bad page (this time) -> 
>> poweroff/system reset (and bad page info (this time) is lost)
>>
>> Tao
>>
>>> -----Original Message-----
>>> From: amd-gfx <amd-gfx-bounces at lists.freedesktop.org> On Behalf Of
>>> Andrey Grodzovsky
>>> Sent: 2019年11月14日 6:45
>>> To: amd-gfx at lists.freedesktop.org
>>> Cc: alexdeucher at gmail.com; Grodzovsky, Andrey
>>> <Andrey.Grodzovsky at amd.com>; Chen, Guchun <Guchun.Chen at amd.com>;
>>> Zhang, Hawking <Hawking.Zhang at amd.com>
>>> Subject: [PATCH 1/2] drm/amdgpu/ras: Extract
>>> amdgpu_ras_save_bad_pages from amdgpu_ras_reserve_bad_pages
>>>
>>> We want to be be able to call them independently from one another so 
>>> that
>>> before GPU reset just amdgpu_ras_save_bad_pages is called.
>>>
>>> Signed-off-by: Andrey Grodzovsky <andrey.grodzovsky at amd.com>
>>> ---
>>>   drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c | 14 ++++----------
>>> drivers/gpu/drm/amd/amdgpu/amdgpu_ras.h |  4 +++-
>>>   2 files changed, 7 insertions(+), 11 deletions(-)
>>>
>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c
>>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c
>>> index 4044834..600a86d 100644
>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c
>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c
>>> @@ -1421,7 +1421,7 @@ int amdgpu_ras_add_bad_pages(struct
>>> amdgpu_device *adev,
>>>    * write error record array to eeprom, the function should be
>>>    * protected by recovery_lock
>>>    */
>>> -static int amdgpu_ras_save_bad_pages(struct amdgpu_device *adev)
>>> +int amdgpu_ras_save_bad_pages(struct amdgpu_device *adev)
>>>   {
>>>       struct amdgpu_ras *con = amdgpu_ras_get_context(adev);
>>>       struct ras_err_handler_data *data;
>>> @@ -1520,7 +1520,7 @@ int amdgpu_ras_reserve_bad_pages(struct
>>> amdgpu_device *adev)
>>>       struct ras_err_handler_data *data;
>>>       uint64_t bp;
>>>       struct amdgpu_bo *bo = NULL;
>>> -    int i, ret = 0;
>>> +    int i;
>>>
>>>       if (!con || !con->eh_data)
>>>           return 0;
>>> @@ -1548,12 +1548,9 @@ int amdgpu_ras_reserve_bad_pages(struct
>>> amdgpu_device *adev)
>>>           data->last_reserved = i + 1;
>>>           bo = NULL;
>>>       }
>>> -
>>> -    /* continue to save bad pages to eeprom even reesrve_vram fails */
>>> -    ret = amdgpu_ras_save_bad_pages(adev);
>>>   out:
>>>       mutex_unlock(&con->recovery_lock);
>>> -    return ret;
>>> +    return 0;
>>>   }
>>>
>>>   /* called when driver unload */
>>> @@ -1615,14 +1612,11 @@ int amdgpu_ras_recovery_init(struct
>>> amdgpu_device *adev)
>>>           ret = amdgpu_ras_load_bad_pages(adev);
>>>           if (ret)
>>>               goto free;
>>> -        ret = amdgpu_ras_reserve_bad_pages(adev);
>>> -        if (ret)
>>> -            goto release;
>>> +        amdgpu_ras_reserve_bad_pages(adev);
>>>       }
>>>
>>>       return 0;
>>>
>>> -release:
>>>       amdgpu_ras_release_bad_pages(adev);
>>>   free:
>>>       kfree((*data)->bps);
>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.h
>>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.h
>>> index f80fd34..12b0797 100644
>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.h
>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.h
>>> @@ -492,6 +492,8 @@ unsigned long
>>> amdgpu_ras_query_error_count(struct amdgpu_device *adev,  int
>>> amdgpu_ras_add_bad_pages(struct amdgpu_device *adev,
>>>           struct eeprom_table_record *bps, int pages);
>>>
>>> +int amdgpu_ras_save_bad_pages(struct amdgpu_device *adev);
>>> +
>>>   int amdgpu_ras_reserve_bad_pages(struct amdgpu_device *adev);
>>>
>>>   static inline int amdgpu_ras_reset_gpu(struct amdgpu_device *adev, 
>>> @@ -
>>> 503,7 +505,7 @@ static inline int amdgpu_ras_reset_gpu(struct
>>> amdgpu_device *adev,
>>>        * i2c may be unstable in gpu reset
>>>        */
>>>       if (in_task())
>>> -        amdgpu_ras_reserve_bad_pages(adev);
>>> +        amdgpu_ras_save_bad_pages(adev);
>>>
>>>       if (atomic_cmpxchg(&ras->in_recovery, 0, 1) == 0)
>>>           schedule_work(&ras->recovery_work);
>>> -- 
>>> 2.7.4
>>>
>>> _______________________________________________
>>> amd-gfx mailing list
>>> amd-gfx at lists.freedesktop.org
>>> https://lists.freedesktop.org/mailman/listinfo/amd-gfx
> _______________________________________________
> amd-gfx mailing list
> amd-gfx at lists.freedesktop.org
> https://lists.freedesktop.org/mailman/listinfo/amd-gfx