[PATCH 1/5] drm/amdgpu: add bad page count threshold in module parameter

Alex Deucher alexdeucher at gmail.com
Thu Jul 23 13:39:35 UTC 2020


Also note that module parameters are global.  If you change the
parameter, it changes it for all GPUs in the system.  That may not be
what the customer wants.

Alex

On Thu, Jul 23, 2020 at 9:10 AM Christian König
<ckoenig.leichtzumerken at gmail.com> wrote:
>
> I agree with Guchun as well.
>
> When you have a dynamic module parameter and change the bad page
> threshold the GPU might just stop working suddenly.
>
> That is not a good idea as far as I can see.
>
> Regards,
> Christian.
>
> Am 23.07.20 um 05:47 schrieb Chen, Guchun:
> > [AMD Public Use]
> >
> > Hi Dennis,
> >
> > To be honest, your suggestion is considered when I start the design. My thought is in actual world, bad page threshold is one static configuration, it should be set once when probing.
> > So module parameter is one ideal choice for this.
> >
> > Regards,
> > Guchun
> >
> > -----Original Message-----
> > From: Li, Dennis <Dennis.Li at amd.com>
> > Sent: Thursday, July 23, 2020 8:32 AM
> > To: Chen, Guchun <Guchun.Chen at amd.com>; amd-gfx at lists.freedesktop.org; Deucher, Alexander <Alexander.Deucher at amd.com>; Zhang, Hawking <Hawking.Zhang at amd.com>; Yang, Stanley <Stanley.Yang at amd.com>; Zhou1, Tao <Tao.Zhou1 at amd.com>; Clements, John <John.Clements at amd.com>
> > Subject: RE: [PATCH 1/5] drm/amdgpu: add bad page count threshold in module parameter
> >
> > [AMD Official Use Only - Internal Distribution Only]
> >
> > Hi, Guchun,
> >        It is better to let user be able to change amdgpu_bad_page_threshold with sysfs, so that users no need to reboot system when they want to change their strategy.
> >
> > Best Regards
> > Dennis Li
> > -----Original Message-----
> > From: Chen, Guchun <Guchun.Chen at amd.com>
> > Sent: Wednesday, July 22, 2020 11:14 AM
> > To: amd-gfx at lists.freedesktop.org; Deucher, Alexander <Alexander.Deucher at amd.com>; Zhang, Hawking <Hawking.Zhang at amd.com>; Li, Dennis <Dennis.Li at amd.com>; Yang, Stanley <Stanley.Yang at amd.com>; Zhou1, Tao <Tao.Zhou1 at amd.com>; Clements, John <John.Clements at amd.com>
> > Cc: Chen, Guchun <Guchun.Chen at amd.com>
> > Subject: [PATCH 1/5] drm/amdgpu: add bad page count threshold in module parameter
> >
> > bad_page_threshold could be specified to detect and retire bad GPU if faulty bad pages exceed it.
> >
> > When it's -1, ras will use typical bad page failure value.
> >
> > Signed-off-by: Guchun Chen <guchun.chen at amd.com>
> > ---
> >   drivers/gpu/drm/amd/amdgpu/amdgpu.h     |  1 +
> >   drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c | 11 +++++++++++
> >   2 files changed, 12 insertions(+)
> >
> > diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu.h b/drivers/gpu/drm/amd/amdgpu/amdgpu.h
> > index 06bfb8658dec..bb83ffb5e26a 100644
> > --- a/drivers/gpu/drm/amd/amdgpu/amdgpu.h
> > +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu.h
> > @@ -181,6 +181,7 @@ extern uint amdgpu_dm_abm_level;  extern struct amdgpu_mgpu_info mgpu_info;  extern int amdgpu_ras_enable;  extern uint amdgpu_ras_mask;
> > +extern int amdgpu_bad_page_threshold;
> >   extern int amdgpu_async_gfx_ring;
> >   extern int amdgpu_mcbp;
> >   extern int amdgpu_discovery;
> > diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c
> > index d28b95f721c4..f99671101746 100644
> > --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c
> > +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c
> > @@ -161,6 +161,7 @@ struct amdgpu_mgpu_info mgpu_info = {  };  int amdgpu_ras_enable = -1;  uint amdgpu_ras_mask = 0xffffffff;
> > +int amdgpu_bad_page_threshold = -1;
> >
> >   /**
> >    * DOC: vramlimit (int)
> > @@ -801,6 +802,16 @@ module_param_named(tmz, amdgpu_tmz, int, 0444);  MODULE_PARM_DESC(reset_method, "GPU reset method (-1 = auto (default), 0 = legacy, 1 = mode0, 2 = mode1, 3 = mode2, 4 = baco)");  module_param_named(reset_method, amdgpu_reset_method, int, 0444);
> >
> > +/**
> > + * DOC: bad_page_threshold (int)
> > + * Bad page threshold configuration is driven by RMA(Return Merchandise
> > + * Authorization) policy, which is to specify the threshold value of
> > +faulty
> > + * pages detected by ECC, which may result in GPU's retirement if total
> > + * faulty pages by ECC exceed threshold value.
> > + */
> > +MODULE_PARM_DESC(bad_page_threshold, "Bad page threshold(-1 =
> > +auto(default typical value))"); module_param_named(bad_page_threshold,
> > +amdgpu_bad_page_threshold, int, 0444);
> > +
> >   static const struct pci_device_id pciidlist[] = {  #ifdef  CONFIG_DRM_AMDGPU_SI
> >       {0x1002, 0x6780, PCI_ANY_ID, PCI_ANY_ID, 0, 0, CHIP_TAHITI},
> > --
> > 2.17.1
> > _______________________________________________
> > amd-gfx mailing list
> > amd-gfx at lists.freedesktop.org
> > https://lists.freedesktop.org/mailman/listinfo/amd-gfx
>
> _______________________________________________
> amd-gfx mailing list
> amd-gfx at lists.freedesktop.org
> https://lists.freedesktop.org/mailman/listinfo/amd-gfx


More information about the amd-gfx mailing list