[PATCH] drm/amdgpu: don't increase UMC RAS UE count if no new bad page
Zhang, Hawking
Hawking.Zhang at amd.com
Tue Feb 14 10:03:26 UTC 2023
[AMD Official Use Only - General]
-EINVAL looks better than EEXIST, but it still doesn't fit the case? Alternatively, Can we compare the count before and after amdgpu_ras_add_bad_pages to decide whether reset ue_count to 0 or not. That could be straightforward than struggling for returning which error code in this specific case.
Regards,
Hawking
-----Original Message-----
From: Yang, Stanley <Stanley.Yang at amd.com>
Sent: Tuesday, February 14, 2023 10:42
To: Zhou1, Tao <Tao.Zhou1 at amd.com>; Zhang, Hawking <Hawking.Zhang at amd.com>; amd-gfx at lists.freedesktop.org; Chai, Thomas <YiPeng.Chai at amd.com>; Li, Candice <Candice.Li at amd.com>
Subject: RE: [PATCH] drm/amdgpu: don't increase UMC RAS UE count if no new bad page
[AMD Official Use Only - General]
> -----Original Message-----
> From: Zhou1, Tao <Tao.Zhou1 at amd.com>
> Sent: Monday, February 13, 2023 4:25 PM
> To: Zhang, Hawking <Hawking.Zhang at amd.com>; amd-
> gfx at lists.freedesktop.org; Yang, Stanley <Stanley.Yang at amd.com>; Chai,
> Thomas <YiPeng.Chai at amd.com>; Li, Candice <Candice.Li at amd.com>
> Subject: RE: [PATCH] drm/amdgpu: don't increase UMC RAS UE count if no
> new bad page
>
> [AMD Official Use Only - General]
>
> > -----Original Message-----
> > From: Zhang, Hawking <Hawking.Zhang at amd.com>
> > Sent: Friday, February 10, 2023 11:02 PM
> > To: Zhou1, Tao <Tao.Zhou1 at amd.com>; amd-gfx at lists.freedesktop.org;
> > Yang, Stanley <Stanley.Yang at amd.com>; Chai, Thomas
> > <YiPeng.Chai at amd.com>; Li, Candice <Candice.Li at amd.com>
> > Subject: RE: [PATCH] drm/amdgpu: don't increase UMC RAS UE count if
> > no new bad page
> >
> > [AMD Official Use Only - General]
> >
> > + /* if no new bad page is found, no need to
> > + increase ue count
> */
> > + if (ret == -EEXIST)
> > + err_data->ue_count = 0;
> >
> > Returning EEXIST in such case is not reasonable. Might consider
> > return a bool for
> > amdgpu_ras_add_bad_pages: true means it does add some new bad page;
> > false means it doesn't change anything.
> >
> > Regards,
> > Hawking
>
> [Tao] but it can returns -ENOMEM, amdgpu_ras_load_bad_pages and
> amdgpu_ras_recovery_init also need to check the return value. I'd like
> to keep the type of return value unchanged.
> How about -EINVAL?
Stanley: How about return -EALREADY?
Regards,
Stanley
>
> >
> > -----Original Message-----
> > From: Zhou1, Tao <Tao.Zhou1 at amd.com>
> > Sent: Friday, February 10, 2023 16:45
> > To: amd-gfx at lists.freedesktop.org; Zhang, Hawking
> > <Hawking.Zhang at amd.com>; Yang, Stanley <Stanley.Yang at amd.com>;
> Chai,
> > Thomas <YiPeng.Chai at amd.com>; Li, Candice <Candice.Li at amd.com>
> > Cc: Zhou1, Tao <Tao.Zhou1 at amd.com>
> > Subject: [PATCH] drm/amdgpu: don't increase UMC RAS UE count if no
> new
> > bad page
> >
> > If a UMC bad page is reserved but not freed by an application, the
> > application may trigger uncorrectable error repeatly by accessing the page.
> >
> > Signed-off-by: Tao Zhou <tao.zhou1 at amd.com>
> > ---
> > drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c | 9 ++++++++-
> > drivers/gpu/drm/amd/amdgpu/amdgpu_umc.c | 6 +++++-
> > 2 files changed, 13 insertions(+), 2 deletions(-)
> >
> > diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c
> > b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c
> > index e85c4689ce2c..eafe01a24349 100644
> > --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c
> > +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c
> > @@ -2049,7 +2049,7 @@ int amdgpu_ras_add_bad_pages(struct
> > amdgpu_device *adev, {
> > struct amdgpu_ras *con = amdgpu_ras_get_context(adev);
> > struct ras_err_handler_data *data;
> > - int ret = 0;
> > + int ret = 0, old_cnt;
> > uint32_t i;
> >
> > if (!con || !con->eh_data || !bps || pages <= 0) @@ -2060,6
> > +2060,8 @@ int amdgpu_ras_add_bad_pages(struct amdgpu_device
> *adev,
> > if (!data)
> > goto out;
> >
> > + old_cnt = data->count;
> > +
> > for (i = 0; i < pages; i++) {
> > if (amdgpu_ras_check_bad_page_unlock(con,
> > bps[i].retired_page <<
> > AMDGPU_GPU_PAGE_SHIFT)) @@ -2079,6
> > +2081,11 @@ int amdgpu_ras_add_bad_pages(struct amdgpu_device
> *adev,
> > data->count++;
> > data->space_left--;
> > }
> > +
> > + /* all pages have been reserved before, no new bad page */
> > + if (old_cnt == data->count)
> > + ret = -EEXIST;
> > +
> > out:
> > mutex_unlock(&con->recovery_lock);
> >
> > diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_umc.c
> > b/drivers/gpu/drm/amd/amdgpu/amdgpu_umc.c
> > index 1c7fcb4f2380..772c431e4065 100644
> > --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_umc.c
> > +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_umc.c
> > @@ -145,8 +145,12 @@ static int
> amdgpu_umc_do_page_retirement(struct
> > amdgpu_device *adev,
> >
> > if ((amdgpu_bad_page_threshold != 0) &&
> > err_data->err_addr_cnt) {
> > - amdgpu_ras_add_bad_pages(adev, err_data->err_addr,
> > + ret = amdgpu_ras_add_bad_pages(adev,
> > + err_data->err_addr,
> >
> > err_data->err_addr_cnt);
> > + /* if no new bad page is found, no need to
> > + increase ue count
> */
> > + if (ret == -EEXIST)
> > + err_data->ue_count = 0;
> > +
> > amdgpu_ras_save_bad_pages(adev);
> >
> > amdgpu_dpm_send_hbm_bad_pages_num(adev, con-
> > >eeprom_control.ras_num_recs);
> > --
> > 2.35.1
> >
>
More information about the amd-gfx
mailing list