[PATCH 1/2] drm/amdgpu/ras: Extract amdgpu_ras_save_bad_pages from amdgpu_ras_reserve_bad_pages
Zhou1, Tao
Tao.Zhou1 at amd.com
Fri Nov 15 02:42:17 UTC 2019
Sorry, I confused reserve_bad_pages with add_bad_pages, you're right.
If vram allocated info could be reserved after gpu reset, it seems your patch is unnecessary.
If there is a risk that the info is lost, I think [data->0, data->count) pages instead of only [data->last_reserved, data->count) pages should be reserved.
Tao
> -----Original Message-----
> From: Grodzovsky, Andrey <Andrey.Grodzovsky at amd.com>
> Sent: 2019年11月15日 0:42
> To: Zhou1, Tao <Tao.Zhou1 at amd.com>; amd-gfx at lists.freedesktop.org;
> Koenig, Christian <Christian.Koenig at amd.com>
> Cc: alexdeucher at gmail.com; Chen, Guchun <Guchun.Chen at amd.com>;
> Zhang, Hawking <Hawking.Zhang at amd.com>
> Subject: Re: [PATCH 1/2] drm/amdgpu/ras: Extract
> amdgpu_ras_save_bad_pages from amdgpu_ras_reserve_bad_pages
>
>
> On 11/13/19 10:09 PM, Zhou1, Tao wrote:
> > Two questions:
> >
> > 1. "we lose all reservation during ASIC reset"
> > Are you sure of it? I remember the content of vram may be lost after reset
> but the allocated status could be reserved.
>
> Yea, now that I am thinking of it I think i might have confused it with BO
> content recovery in amdgpu_device_recover_vram for shadow buffers which
> are page tables only but just for VRAM reservation status this might be
> irrelevant... Christian - can you confirm Tao is correct on this ?
>
> >
> > 2. You change the bad page handle flow from:
> >
> > detect bad page -> reserve vram for bad page -> save bad page info to
> > eeprom -> gpu reset
> >
> > to:
> >
> > detect bad page (this time) -> save bad page (last time) info to
> > eeprom -> gpu reset -> reserve vram for bad page (this time)
>
> Even though if I am wrong on the first point this is irrelevant but still - Why
> saving bad page is from last time ? See
> https://cgit.freedesktop.org/~agd5f/linux/tree/drivers/gpu/drm/amd/amdg
> pu/amdgpu_ras.c?h=amd-staging-drm-next#n1436
> - the save count is the latest so as the content of
> data->bps[control->num_recs] up to data->bps[control->num_recs +
> save_count] as those are updated in
> amdgpu_umc_process_ras_data_cb->amdgpu_ras_add_bad_pages which is
> called right BEFORE amdgpu_ras_reset_gpu->amdgpu_ras_save_bad_pages
> in the interrupt sequence
>
> Andrey
>
>
> >
> > Is that right? Saving bad page (this time) info to eeprom is delayed to the
> next time that bad page is detected? But the time of detecting bad page is
> random.
> > I think the bad page info would be lost without saving to eeprom if power
> off occurs.
> >
> > detect bad page (this time) -> save bad page (last time) info to
> > eeprom -> gpu reset -> reserve vram for bad page (this time) ->
> > poweroff/system reset (and bad page info (this time) is lost)
> >
> > Tao
> >
> >> -----Original Message-----
> >> From: amd-gfx <amd-gfx-bounces at lists.freedesktop.org> On Behalf Of
> >> Andrey Grodzovsky
> >> Sent: 2019年11月14日 6:45
> >> To: amd-gfx at lists.freedesktop.org
> >> Cc: alexdeucher at gmail.com; Grodzovsky, Andrey
> >> <Andrey.Grodzovsky at amd.com>; Chen, Guchun
> <Guchun.Chen at amd.com>;
> >> Zhang, Hawking <Hawking.Zhang at amd.com>
> >> Subject: [PATCH 1/2] drm/amdgpu/ras: Extract
> >> amdgpu_ras_save_bad_pages from amdgpu_ras_reserve_bad_pages
> >>
> >> We want to be be able to call them independently from one another so
> >> that before GPU reset just amdgpu_ras_save_bad_pages is called.
> >>
> >> Signed-off-by: Andrey Grodzovsky <andrey.grodzovsky at amd.com>
> >> ---
> >> drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c | 14 ++++----------
> >> drivers/gpu/drm/amd/amdgpu/amdgpu_ras.h | 4 +++-
> >> 2 files changed, 7 insertions(+), 11 deletions(-)
> >>
> >> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c
> >> b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c
> >> index 4044834..600a86d 100644
> >> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c
> >> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c
> >> @@ -1421,7 +1421,7 @@ int amdgpu_ras_add_bad_pages(struct
> >> amdgpu_device *adev,
> >> * write error record array to eeprom, the function should be
> >> * protected by recovery_lock
> >> */
> >> -static int amdgpu_ras_save_bad_pages(struct amdgpu_device *adev)
> >> +int amdgpu_ras_save_bad_pages(struct amdgpu_device *adev)
> >> {
> >> struct amdgpu_ras *con = amdgpu_ras_get_context(adev);
> >> struct ras_err_handler_data *data; @@ -1520,7 +1520,7 @@ int
> >> amdgpu_ras_reserve_bad_pages(struct
> >> amdgpu_device *adev)
> >> struct ras_err_handler_data *data;
> >> uint64_t bp;
> >> struct amdgpu_bo *bo = NULL;
> >> - int i, ret = 0;
> >> + int i;
> >>
> >> if (!con || !con->eh_data)
> >> return 0;
> >> @@ -1548,12 +1548,9 @@ int amdgpu_ras_reserve_bad_pages(struct
> >> amdgpu_device *adev)
> >> data->last_reserved = i + 1;
> >> bo = NULL;
> >> }
> >> -
> >> - /* continue to save bad pages to eeprom even reesrve_vram fails */
> >> - ret = amdgpu_ras_save_bad_pages(adev);
> >> out:
> >> mutex_unlock(&con->recovery_lock);
> >> - return ret;
> >> + return 0;
> >> }
> >>
> >> /* called when driver unload */
> >> @@ -1615,14 +1612,11 @@ int amdgpu_ras_recovery_init(struct
> >> amdgpu_device *adev)
> >> ret = amdgpu_ras_load_bad_pages(adev);
> >> if (ret)
> >> goto free;
> >> - ret = amdgpu_ras_reserve_bad_pages(adev);
> >> - if (ret)
> >> - goto release;
> >> + amdgpu_ras_reserve_bad_pages(adev);
> >> }
> >>
> >> return 0;
> >>
> >> -release:
> >> amdgpu_ras_release_bad_pages(adev);
> >> free:
> >> kfree((*data)->bps);
> >> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.h
> >> b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.h
> >> index f80fd34..12b0797 100644
> >> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.h
> >> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.h
> >> @@ -492,6 +492,8 @@ unsigned long
> >> amdgpu_ras_query_error_count(struct amdgpu_device *adev, int
> >> amdgpu_ras_add_bad_pages(struct amdgpu_device *adev,
> >> struct eeprom_table_record *bps, int pages);
> >>
> >> +int amdgpu_ras_save_bad_pages(struct amdgpu_device *adev);
> >> +
> >> int amdgpu_ras_reserve_bad_pages(struct amdgpu_device *adev);
> >>
> >> static inline int amdgpu_ras_reset_gpu(struct amdgpu_device *adev,
> >> @@ -
> >> 503,7 +505,7 @@ static inline int amdgpu_ras_reset_gpu(struct
> >> amdgpu_device *adev,
> >> * i2c may be unstable in gpu reset
> >> */
> >> if (in_task())
> >> - amdgpu_ras_reserve_bad_pages(adev);
> >> + amdgpu_ras_save_bad_pages(adev);
> >>
> >> if (atomic_cmpxchg(&ras->in_recovery, 0, 1) == 0)
> >> schedule_work(&ras->recovery_work);
> >> --
> >> 2.7.4
> >>
> >> _______________________________________________
> >> amd-gfx mailing list
> >> amd-gfx at lists.freedesktop.org
> >> https://lists.freedesktop.org/mailman/listinfo/amd-gfx
More information about the amd-gfx
mailing list