[PATCH 3/4] drm/amdgpu: Add kernel parameter for ignoring bad page threshold

Russell, Kent Kent.Russell at amd.com
Tue Oct 19 18:23:18 UTC 2021


[AMD Official Use Only]



> -----Original Message-----
> From: Kuehling, Felix <Felix.Kuehling at amd.com>
> Sent: Tuesday, October 19, 2021 2:13 PM
> To: Russell, Kent <Kent.Russell at amd.com>; amd-gfx at lists.freedesktop.org
> Cc: Tuikov, Luben <Luben.Tuikov at amd.com>; Joshi, Mukul <Mukul.Joshi at amd.com>
> Subject: Re: [PATCH 3/4] drm/amdgpu: Add kernel parameter for ignoring bad page
> threshold
> 
> 
> Am 2021-10-19 um 1:50 p.m. schrieb Kent Russell:
> > When a GPU hits the bad_page_threshold, it will not be initialized by
> > the amdgpu driver. This means that the table cannot be cleared, nor can
> > information gathering be performed (getting serial number, BDF, etc).
> > Add an override called ignore_bad_page_threshold that can be set to true
> > to still initialize the GPU, even when the bad page threshold has been
> > reached.
> Do you really need a new parameter for this? Wouldn't it be enough to
> set bad_page_threshold to the VRAM size? You could use a new special
> value (e.g. bad_page_threshold=-2) for that.

Ah interesting. That could definitely work here. I hadn't thought about co-opting another variable. We already check -1, so why not -2? Great insight. Thanks!

 Kent

> 
> Regards,
>   Felix
> 
> 
> >
> > Cc: Luben Tuikov <luben.tuikov at amd.com>
> > Cc: Mukul Joshi <Mukul.Joshi at amd.com>
> > Signed-off-by: Kent Russell <kent.russell at amd.com>
> > ---
> >  drivers/gpu/drm/amd/amdgpu/amdgpu.h     |  1 +
> >  drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c | 13 +++++++++++++
> >  2 files changed, 14 insertions(+)
> >
> > diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu.h
> b/drivers/gpu/drm/amd/amdgpu/amdgpu.h
> > index d58e37fd01f4..b85b67a88a3d 100644
> > --- a/drivers/gpu/drm/amd/amdgpu/amdgpu.h
> > +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu.h
> > @@ -205,6 +205,7 @@ extern struct amdgpu_mgpu_info mgpu_info;
> >  extern int amdgpu_ras_enable;
> >  extern uint amdgpu_ras_mask;
> >  extern int amdgpu_bad_page_threshold;
> > +extern bool amdgpu_ignore_bad_page_threshold;
> >  extern struct amdgpu_watchdog_timer amdgpu_watchdog_timer;
> >  extern int amdgpu_async_gfx_ring;
> >  extern int amdgpu_mcbp;
> > diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c
> b/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c
> > index 96bd63aeeddd..3e9a7b072888 100644
> > --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c
> > +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c
> > @@ -189,6 +189,7 @@ struct amdgpu_mgpu_info mgpu_info = {
> >  int amdgpu_ras_enable = -1;
> >  uint amdgpu_ras_mask = 0xffffffff;
> >  int amdgpu_bad_page_threshold = -1;
> > +bool amdgpu_ignore_bad_page_threshold;
> >  struct amdgpu_watchdog_timer amdgpu_watchdog_timer = {
> >  	.timeout_fatal_disable = false,
> >  	.period = 0x0, /* default to 0x0 (timeout disable) */
> > @@ -880,6 +881,18 @@ module_param_named(reset_method, amdgpu_reset_method,
> int, 0444);
> >  MODULE_PARM_DESC(bad_page_threshold, "Bad page threshold(-1 = auto(default
> value), 0 = disable bad page retirement)");
> >  module_param_named(bad_page_threshold, amdgpu_bad_page_threshold, int, 0444);
> >
> > +/**
> > + * DOC: ignore_bad_page_threshold (bool) Bad page threshold specifies
> > + * the threshold value of faulty pages detected by RAS ECC. Once the
> > + * threshold is hit, the GPU will not be initialized. Use this parameter
> > + * to ignore the bad page threshold so that information gathering can
> > + * still be performed. This also allows for booting the GPU to clear
> > + * the RAS EEPROM table.
> > + */
> > +
> > +MODULE_PARM_DESC(ignore_bad_page_threshold, "Ignore bad page threshold (false =
> respect bad page threshold (default value)");
> > +module_param_named(ignore_bad_page_threshold,
> amdgpu_ignore_bad_page_threshold, bool, 0644);
> > +
> >  MODULE_PARM_DESC(num_kcq, "number of kernel compute queue user want to setup
> (8 if set to greater than 8 or less than 0, only affect gfx 8+)");
> >  module_param_named(num_kcq, amdgpu_num_kcq, int, 0444);
> >


More information about the amd-gfx mailing list