[PATCH 1/2] drm/amdgpu/ras: Extract amdgpu_ras_save_bad_pages from amdgpu_ras_reserve_bad_pages
Christian König
christian.koenig at amd.com
Mon Nov 18 14:17:27 UTC 2019
Am 18.11.19 um 15:05 schrieb Andrey Grodzovsky:
> Christian - ping.
>
> Andrey
>
> On 11/14/19 11:41 AM, Andrey Grodzovsky wrote:
>>
>> On 11/13/19 10:09 PM, Zhou1, Tao wrote:
>>> Two questions:
>>>
>>> 1. "we lose all reservation during ASIC reset"
>>> Are you sure of it? I remember the content of vram may be lost after
>>> reset but the allocated status could be reserved.
>>
>> Yea, now that I am thinking of it I think i might have confused it
>> with BO content recovery in amdgpu_device_recover_vram for shadow
>> buffers which are page tables only but just for VRAM reservation
>> status this might be irrelevant... Christian - can you confirm Tao is
>> correct on this ?
Yeah, that is correct. The BO structure stays, just the content is lost.
You guys should maybe consider moving all that stuff into
amdgpu_vram_mgr.c. It is far easier to handle there because you don't
need to care about memory which is currently allocated.
Regards,
Christian.
>>
>>>
>>> 2. You change the bad page handle flow from:
>>>
>>> detect bad page -> reserve vram for bad page -> save bad page info
>>> to eeprom -> gpu reset
>>>
>>> to:
>>>
>>> detect bad page (this time) -> save bad page (last time) info to
>>> eeprom -> gpu reset -> reserve vram for bad page (this time)
>>
>> Even though if I am wrong on the first point this is irrelevant but
>> still - Why saving bad page is from last time ? See
>> https://cgit.freedesktop.org/~agd5f/linux/tree/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c?h=amd-staging-drm-next#n1436
>> - the save count is the latest so as the content of
>> data->bps[control->num_recs] up to data->bps[control->num_recs +
>> save_count] as those are updated in
>> amdgpu_umc_process_ras_data_cb->amdgpu_ras_add_bad_pages which is
>> called right BEFORE amdgpu_ras_reset_gpu->amdgpu_ras_save_bad_pages
>> in the interrupt sequence
>>
>> Andrey
>>
>>
>>>
>>> Is that right? Saving bad page (this time) info to eeprom is
>>> delayed to the next time that bad page is detected? But the time of
>>> detecting bad page is random.
>>> I think the bad page info would be lost without saving to eeprom if
>>> power off occurs.
>>>
>>> detect bad page (this time) -> save bad page (last time) info to
>>> eeprom -> gpu reset -> reserve vram for bad page (this time) ->
>>> poweroff/system reset (and bad page info (this time) is lost)
>>>
>>> Tao
>>>
>>>> -----Original Message-----
>>>> From: amd-gfx <amd-gfx-bounces at lists.freedesktop.org> On Behalf Of
>>>> Andrey Grodzovsky
>>>> Sent: 2019年11月14日 6:45
>>>> To: amd-gfx at lists.freedesktop.org
>>>> Cc: alexdeucher at gmail.com; Grodzovsky, Andrey
>>>> <Andrey.Grodzovsky at amd.com>; Chen, Guchun <Guchun.Chen at amd.com>;
>>>> Zhang, Hawking <Hawking.Zhang at amd.com>
>>>> Subject: [PATCH 1/2] drm/amdgpu/ras: Extract
>>>> amdgpu_ras_save_bad_pages from amdgpu_ras_reserve_bad_pages
>>>>
>>>> We want to be be able to call them independently from one another
>>>> so that
>>>> before GPU reset just amdgpu_ras_save_bad_pages is called.
>>>>
>>>> Signed-off-by: Andrey Grodzovsky <andrey.grodzovsky at amd.com>
>>>> ---
>>>> drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c | 14 ++++----------
>>>> drivers/gpu/drm/amd/amdgpu/amdgpu_ras.h | 4 +++-
>>>> 2 files changed, 7 insertions(+), 11 deletions(-)
>>>>
>>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c
>>>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c
>>>> index 4044834..600a86d 100644
>>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c
>>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c
>>>> @@ -1421,7 +1421,7 @@ int amdgpu_ras_add_bad_pages(struct
>>>> amdgpu_device *adev,
>>>> * write error record array to eeprom, the function should be
>>>> * protected by recovery_lock
>>>> */
>>>> -static int amdgpu_ras_save_bad_pages(struct amdgpu_device *adev)
>>>> +int amdgpu_ras_save_bad_pages(struct amdgpu_device *adev)
>>>> {
>>>> struct amdgpu_ras *con = amdgpu_ras_get_context(adev);
>>>> struct ras_err_handler_data *data;
>>>> @@ -1520,7 +1520,7 @@ int amdgpu_ras_reserve_bad_pages(struct
>>>> amdgpu_device *adev)
>>>> struct ras_err_handler_data *data;
>>>> uint64_t bp;
>>>> struct amdgpu_bo *bo = NULL;
>>>> - int i, ret = 0;
>>>> + int i;
>>>>
>>>> if (!con || !con->eh_data)
>>>> return 0;
>>>> @@ -1548,12 +1548,9 @@ int amdgpu_ras_reserve_bad_pages(struct
>>>> amdgpu_device *adev)
>>>> data->last_reserved = i + 1;
>>>> bo = NULL;
>>>> }
>>>> -
>>>> - /* continue to save bad pages to eeprom even reesrve_vram
>>>> fails */
>>>> - ret = amdgpu_ras_save_bad_pages(adev);
>>>> out:
>>>> mutex_unlock(&con->recovery_lock);
>>>> - return ret;
>>>> + return 0;
>>>> }
>>>>
>>>> /* called when driver unload */
>>>> @@ -1615,14 +1612,11 @@ int amdgpu_ras_recovery_init(struct
>>>> amdgpu_device *adev)
>>>> ret = amdgpu_ras_load_bad_pages(adev);
>>>> if (ret)
>>>> goto free;
>>>> - ret = amdgpu_ras_reserve_bad_pages(adev);
>>>> - if (ret)
>>>> - goto release;
>>>> + amdgpu_ras_reserve_bad_pages(adev);
>>>> }
>>>>
>>>> return 0;
>>>>
>>>> -release:
>>>> amdgpu_ras_release_bad_pages(adev);
>>>> free:
>>>> kfree((*data)->bps);
>>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.h
>>>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.h
>>>> index f80fd34..12b0797 100644
>>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.h
>>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.h
>>>> @@ -492,6 +492,8 @@ unsigned long
>>>> amdgpu_ras_query_error_count(struct amdgpu_device *adev, int
>>>> amdgpu_ras_add_bad_pages(struct amdgpu_device *adev,
>>>> struct eeprom_table_record *bps, int pages);
>>>>
>>>> +int amdgpu_ras_save_bad_pages(struct amdgpu_device *adev);
>>>> +
>>>> int amdgpu_ras_reserve_bad_pages(struct amdgpu_device *adev);
>>>>
>>>> static inline int amdgpu_ras_reset_gpu(struct amdgpu_device
>>>> *adev, @@ -
>>>> 503,7 +505,7 @@ static inline int amdgpu_ras_reset_gpu(struct
>>>> amdgpu_device *adev,
>>>> * i2c may be unstable in gpu reset
>>>> */
>>>> if (in_task())
>>>> - amdgpu_ras_reserve_bad_pages(adev);
>>>> + amdgpu_ras_save_bad_pages(adev);
>>>>
>>>> if (atomic_cmpxchg(&ras->in_recovery, 0, 1) == 0)
>>>> schedule_work(&ras->recovery_work);
>>>> --
>>>> 2.7.4
>>>>
>>>> _______________________________________________
>>>> amd-gfx mailing list
>>>> amd-gfx at lists.freedesktop.org
>>>> https://lists.freedesktop.org/mailman/listinfo/amd-gfx
>> _______________________________________________
>> amd-gfx mailing list
>> amd-gfx at lists.freedesktop.org
>> https://lists.freedesktop.org/mailman/listinfo/amd-gfx
More information about the amd-gfx
mailing list