[PATCH 1/2] drm/amdgpu/ras: Extract amdgpu_ras_save_bad_pages from amdgpu_ras_reserve_bad_pages
Andrey Grodzovsky
Andrey.Grodzovsky at amd.com
Mon Nov 18 14:05:26 UTC 2019
Christian - ping.
Andrey
On 11/14/19 11:41 AM, Andrey Grodzovsky wrote:
>
> On 11/13/19 10:09 PM, Zhou1, Tao wrote:
>> Two questions:
>>
>> 1. "we lose all reservation during ASIC reset"
>> Are you sure of it? I remember the content of vram may be lost after
>> reset but the allocated status could be reserved.
>
> Yea, now that I am thinking of it I think i might have confused it
> with BO content recovery in amdgpu_device_recover_vram for shadow
> buffers which are page tables only but just for VRAM reservation
> status this might be irrelevant... Christian - can you confirm Tao is
> correct on this ?
>
>>
>> 2. You change the bad page handle flow from:
>>
>> detect bad page -> reserve vram for bad page -> save bad page info to
>> eeprom -> gpu reset
>>
>> to:
>>
>> detect bad page (this time) -> save bad page (last time) info to
>> eeprom -> gpu reset -> reserve vram for bad page (this time)
>
> Even though if I am wrong on the first point this is irrelevant but
> still - Why saving bad page is from last time ? See
> https://cgit.freedesktop.org/~agd5f/linux/tree/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c?h=amd-staging-drm-next#n1436
> - the save count is the latest so as the content of
> data->bps[control->num_recs] up to data->bps[control->num_recs +
> save_count] as those are updated in
> amdgpu_umc_process_ras_data_cb->amdgpu_ras_add_bad_pages which is
> called right BEFORE amdgpu_ras_reset_gpu->amdgpu_ras_save_bad_pages in
> the interrupt sequence
>
> Andrey
>
>
>>
>> Is that right? Saving bad page (this time) info to eeprom is delayed
>> to the next time that bad page is detected? But the time of detecting
>> bad page is random.
>> I think the bad page info would be lost without saving to eeprom if
>> power off occurs.
>>
>> detect bad page (this time) -> save bad page (last time) info to
>> eeprom -> gpu reset -> reserve vram for bad page (this time) ->
>> poweroff/system reset (and bad page info (this time) is lost)
>>
>> Tao
>>
>>> -----Original Message-----
>>> From: amd-gfx <amd-gfx-bounces at lists.freedesktop.org> On Behalf Of
>>> Andrey Grodzovsky
>>> Sent: 2019年11月14日 6:45
>>> To: amd-gfx at lists.freedesktop.org
>>> Cc: alexdeucher at gmail.com; Grodzovsky, Andrey
>>> <Andrey.Grodzovsky at amd.com>; Chen, Guchun <Guchun.Chen at amd.com>;
>>> Zhang, Hawking <Hawking.Zhang at amd.com>
>>> Subject: [PATCH 1/2] drm/amdgpu/ras: Extract
>>> amdgpu_ras_save_bad_pages from amdgpu_ras_reserve_bad_pages
>>>
>>> We want to be be able to call them independently from one another so
>>> that
>>> before GPU reset just amdgpu_ras_save_bad_pages is called.
>>>
>>> Signed-off-by: Andrey Grodzovsky <andrey.grodzovsky at amd.com>
>>> ---
>>> drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c | 14 ++++----------
>>> drivers/gpu/drm/amd/amdgpu/amdgpu_ras.h | 4 +++-
>>> 2 files changed, 7 insertions(+), 11 deletions(-)
>>>
>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c
>>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c
>>> index 4044834..600a86d 100644
>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c
>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c
>>> @@ -1421,7 +1421,7 @@ int amdgpu_ras_add_bad_pages(struct
>>> amdgpu_device *adev,
>>> * write error record array to eeprom, the function should be
>>> * protected by recovery_lock
>>> */
>>> -static int amdgpu_ras_save_bad_pages(struct amdgpu_device *adev)
>>> +int amdgpu_ras_save_bad_pages(struct amdgpu_device *adev)
>>> {
>>> struct amdgpu_ras *con = amdgpu_ras_get_context(adev);
>>> struct ras_err_handler_data *data;
>>> @@ -1520,7 +1520,7 @@ int amdgpu_ras_reserve_bad_pages(struct
>>> amdgpu_device *adev)
>>> struct ras_err_handler_data *data;
>>> uint64_t bp;
>>> struct amdgpu_bo *bo = NULL;
>>> - int i, ret = 0;
>>> + int i;
>>>
>>> if (!con || !con->eh_data)
>>> return 0;
>>> @@ -1548,12 +1548,9 @@ int amdgpu_ras_reserve_bad_pages(struct
>>> amdgpu_device *adev)
>>> data->last_reserved = i + 1;
>>> bo = NULL;
>>> }
>>> -
>>> - /* continue to save bad pages to eeprom even reesrve_vram fails */
>>> - ret = amdgpu_ras_save_bad_pages(adev);
>>> out:
>>> mutex_unlock(&con->recovery_lock);
>>> - return ret;
>>> + return 0;
>>> }
>>>
>>> /* called when driver unload */
>>> @@ -1615,14 +1612,11 @@ int amdgpu_ras_recovery_init(struct
>>> amdgpu_device *adev)
>>> ret = amdgpu_ras_load_bad_pages(adev);
>>> if (ret)
>>> goto free;
>>> - ret = amdgpu_ras_reserve_bad_pages(adev);
>>> - if (ret)
>>> - goto release;
>>> + amdgpu_ras_reserve_bad_pages(adev);
>>> }
>>>
>>> return 0;
>>>
>>> -release:
>>> amdgpu_ras_release_bad_pages(adev);
>>> free:
>>> kfree((*data)->bps);
>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.h
>>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.h
>>> index f80fd34..12b0797 100644
>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.h
>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.h
>>> @@ -492,6 +492,8 @@ unsigned long
>>> amdgpu_ras_query_error_count(struct amdgpu_device *adev, int
>>> amdgpu_ras_add_bad_pages(struct amdgpu_device *adev,
>>> struct eeprom_table_record *bps, int pages);
>>>
>>> +int amdgpu_ras_save_bad_pages(struct amdgpu_device *adev);
>>> +
>>> int amdgpu_ras_reserve_bad_pages(struct amdgpu_device *adev);
>>>
>>> static inline int amdgpu_ras_reset_gpu(struct amdgpu_device *adev,
>>> @@ -
>>> 503,7 +505,7 @@ static inline int amdgpu_ras_reset_gpu(struct
>>> amdgpu_device *adev,
>>> * i2c may be unstable in gpu reset
>>> */
>>> if (in_task())
>>> - amdgpu_ras_reserve_bad_pages(adev);
>>> + amdgpu_ras_save_bad_pages(adev);
>>>
>>> if (atomic_cmpxchg(&ras->in_recovery, 0, 1) == 0)
>>> schedule_work(&ras->recovery_work);
>>> --
>>> 2.7.4
>>>
>>> _______________________________________________
>>> amd-gfx mailing list
>>> amd-gfx at lists.freedesktop.org
>>> https://lists.freedesktop.org/mailman/listinfo/amd-gfx
> _______________________________________________
> amd-gfx mailing list
> amd-gfx at lists.freedesktop.org
> https://lists.freedesktop.org/mailman/listinfo/amd-gfx
More information about the amd-gfx
mailing list