[PATCH 1/2] drm/amdgpu/ras: Extract amdgpu_ras_save_bad_pages from amdgpu_ras_reserve_bad_pages

Mon Nov 18 14:17:27 UTC 2019

Am 18.11.19 um 15:05 schrieb Andrey Grodzovsky:
> Christian - ping.
>
> Andrey
>
> On 11/14/19 11:41 AM, Andrey Grodzovsky wrote:
>>
>> On 11/13/19 10:09 PM, Zhou1, Tao wrote:
>>> Two questions:
>>>
>>> 1. "we lose all reservation during ASIC reset"
>>> Are you sure of it? I remember the content of vram may be lost after 
>>> reset but the allocated status could be reserved.
>>
>> Yea, now that I am thinking of it I think i might have confused it 
>> with BO content recovery in amdgpu_device_recover_vram for shadow 
>> buffers which are page tables only but just for VRAM reservation 
>> status this might be irrelevant... Christian - can you confirm Tao is 
>> correct on this ?

Yeah, that is correct. The BO structure stays, just the content is lost.

You guys should maybe consider moving all that stuff into 
amdgpu_vram_mgr.c. It is far easier to handle there because you don't 
need to care about memory which is currently allocated.

Regards,
Christian.

>>
>>>
>>> 2. You change the bad page handle flow from:
>>>
>>> detect bad page -> reserve vram for bad page -> save bad page info 
>>> to eeprom -> gpu reset
>>>
>>> to:
>>>
>>> detect bad page (this time) -> save bad page (last time) info to 
>>> eeprom -> gpu reset -> reserve vram for bad page (this time)
>>
>> Even though if I am wrong on the first point this is irrelevant but 
>> still - Why saving bad page is from last time ? See 
>> https://cgit.freedesktop.org/~agd5f/linux/tree/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c?h=amd-staging-drm-next#n1436 
>> - the save count is the latest so as the content of 
>> data->bps[control->num_recs] up to data->bps[control->num_recs + 
>> save_count] as those are updated in 
>> amdgpu_umc_process_ras_data_cb->amdgpu_ras_add_bad_pages which is 
>> called right BEFORE amdgpu_ras_reset_gpu->amdgpu_ras_save_bad_pages 
>> in the interrupt sequence
>>
>> Andrey
>>
>>
>>>
>>> Is that right?  Saving bad page (this time) info to eeprom is 
>>> delayed to the next time that bad page is detected? But the time of 
>>> detecting bad page is random.
>>> I think the bad page info would be lost without saving to eeprom if 
>>> power off occurs.
>>>
>>> detect bad page (this time) -> save bad page (last time) info to 
>>> eeprom -> gpu reset -> reserve vram for bad page (this time) -> 
>>> poweroff/system reset (and bad page info (this time) is lost)
>>>
>>> Tao
>>>
>>>> -----Original Message-----
>>>> From: amd-gfx <amd-gfx-bounces at lists.freedesktop.org> On Behalf Of
>>>> Andrey Grodzovsky
>>>> Sent: 2019年11月14日 6:45
>>>> To: amd-gfx at lists.freedesktop.org
>>>> Cc: alexdeucher at gmail.com; Grodzovsky, Andrey
>>>> <Andrey.Grodzovsky at amd.com>; Chen, Guchun <Guchun.Chen at amd.com>;
>>>> Zhang, Hawking <Hawking.Zhang at amd.com>
>>>> Subject: [PATCH 1/2] drm/amdgpu/ras: Extract
>>>> amdgpu_ras_save_bad_pages from amdgpu_ras_reserve_bad_pages
>>>>
>>>> We want to be be able to call them independently from one another 
>>>> so that
>>>> before GPU reset just amdgpu_ras_save_bad_pages is called.
>>>>
>>>> Signed-off-by: Andrey Grodzovsky <andrey.grodzovsky at amd.com>
>>>> ---
>>>>   drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c | 14 ++++----------
>>>> drivers/gpu/drm/amd/amdgpu/amdgpu_ras.h |  4 +++-
>>>>   2 files changed, 7 insertions(+), 11 deletions(-)
>>>>
>>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c
>>>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c
>>>> index 4044834..600a86d 100644
>>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c
>>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c
>>>> @@ -1421,7 +1421,7 @@ int amdgpu_ras_add_bad_pages(struct
>>>> amdgpu_device *adev,
>>>>    * write error record array to eeprom, the function should be
>>>>    * protected by recovery_lock
>>>>    */
>>>> -static int amdgpu_ras_save_bad_pages(struct amdgpu_device *adev)
>>>> +int amdgpu_ras_save_bad_pages(struct amdgpu_device *adev)
>>>>   {
>>>>       struct amdgpu_ras *con = amdgpu_ras_get_context(adev);
>>>>       struct ras_err_handler_data *data;
>>>> @@ -1520,7 +1520,7 @@ int amdgpu_ras_reserve_bad_pages(struct
>>>> amdgpu_device *adev)
>>>>       struct ras_err_handler_data *data;
>>>>       uint64_t bp;
>>>>       struct amdgpu_bo *bo = NULL;
>>>> -    int i, ret = 0;
>>>> +    int i;
>>>>
>>>>       if (!con || !con->eh_data)
>>>>           return 0;
>>>> @@ -1548,12 +1548,9 @@ int amdgpu_ras_reserve_bad_pages(struct
>>>> amdgpu_device *adev)
>>>>           data->last_reserved = i + 1;
>>>>           bo = NULL;
>>>>       }
>>>> -
>>>> -    /* continue to save bad pages to eeprom even reesrve_vram 
>>>> fails */
>>>> -    ret = amdgpu_ras_save_bad_pages(adev);
>>>>   out:
>>>>       mutex_unlock(&con->recovery_lock);
>>>> -    return ret;
>>>> +    return 0;
>>>>   }
>>>>
>>>>   /* called when driver unload */
>>>> @@ -1615,14 +1612,11 @@ int amdgpu_ras_recovery_init(struct
>>>> amdgpu_device *adev)
>>>>           ret = amdgpu_ras_load_bad_pages(adev);
>>>>           if (ret)
>>>>               goto free;
>>>> -        ret = amdgpu_ras_reserve_bad_pages(adev);
>>>> -        if (ret)
>>>> -            goto release;
>>>> +        amdgpu_ras_reserve_bad_pages(adev);
>>>>       }
>>>>
>>>>       return 0;
>>>>
>>>> -release:
>>>>       amdgpu_ras_release_bad_pages(adev);
>>>>   free:
>>>>       kfree((*data)->bps);
>>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.h
>>>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.h
>>>> index f80fd34..12b0797 100644
>>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.h
>>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.h
>>>> @@ -492,6 +492,8 @@ unsigned long
>>>> amdgpu_ras_query_error_count(struct amdgpu_device *adev, int
>>>> amdgpu_ras_add_bad_pages(struct amdgpu_device *adev,
>>>>           struct eeprom_table_record *bps, int pages);
>>>>
>>>> +int amdgpu_ras_save_bad_pages(struct amdgpu_device *adev);
>>>> +
>>>>   int amdgpu_ras_reserve_bad_pages(struct amdgpu_device *adev);
>>>>
>>>>   static inline int amdgpu_ras_reset_gpu(struct amdgpu_device 
>>>> *adev, @@ -
>>>> 503,7 +505,7 @@ static inline int amdgpu_ras_reset_gpu(struct
>>>> amdgpu_device *adev,
>>>>        * i2c may be unstable in gpu reset
>>>>        */
>>>>       if (in_task())
>>>> -        amdgpu_ras_reserve_bad_pages(adev);
>>>> +        amdgpu_ras_save_bad_pages(adev);
>>>>
>>>>       if (atomic_cmpxchg(&ras->in_recovery, 0, 1) == 0)
>>>>           schedule_work(&ras->recovery_work);
>>>> -- 
>>>> 2.7.4
>>>>
>>>> _______________________________________________
>>>> amd-gfx mailing list
>>>> amd-gfx at lists.freedesktop.org
>>>> https://lists.freedesktop.org/mailman/listinfo/amd-gfx
>> _______________________________________________
>> amd-gfx mailing list
>> amd-gfx at lists.freedesktop.org
>> https://lists.freedesktop.org/mailman/listinfo/amd-gfx