[PATCH 3/3] drm/amdgpu: Add kernel parameter to control use of ECC/EDC.

Panariti, David David.Panariti at amd.com
Mon Jun 26 16:06:27 UTC 2017


>> > I'd suggest setting amdgpu_ecc_flags to AMD_ECC_SUPPORT_BEST by 
>> > default.  That can be our asic specific default setting.  In the 
>> > case of CZ, that will be disabled until we decide to enable EDC by default.
>> [davep] I'm confused.  ECC...BEST will cause EDC to be enabled.
>> I used ECC as the generic term for ECC and EDC, since ECC seems more 
>> basic (EDC is built on top of ECC).
>> If I understand you, we can't do what you want with the current setup.

>I'm saying we make ECC_BEST the default config (feel free to re-name if ECC_DEFAULT).  Each asic can have a different default depending 
>on what features are ready.  So for CZ, we'd make ECC_BEST equivalent to disabling ECC for now.  If a user wants to force it on, they can 
>set ECC_EDC.  Once EDC is stable on CZ, we can make ECC_BEST be equivalent to ECC_EDC.  The way the default (ECC_BEST) always maps 
>to the best available combination in that version of the driver.

That's not how I meant it to work WRT BEST.
Each asic will have a DEFAULT, but that isn't what BEST means.
CZ is a good example (when fully implemented).  DEFAULT for CZ is everything except HALT, since, IMO opinion, most people do not want to hang or reboot.
BEST for CZ would be everything a person most interested in reliability would want, which IMO, includes HALT/reboot.
Similar is if something like performance degradation is really bad, DEFAULT would be OFF. BEST would be ON, e.g., if the user's app doesn't trigger the performance problem.
The BEST bit is in a fixed position, so that customers don't need to worry what bits are needed for the most reliable performance (in our opinion) on a given asic.
And if a customer (or developer) wants some arbitrary set of features, they can set bits as they want.

I think DEFAULT will make most people happy.
BEST allows people who are interested in everything they can get, regardless of any issues that brings with it. It is requested simply by using a fixed param value (0x01) for any asic.
This probably should not include features that have any kind of fatal flaw such as the Vega10 HBM ECC issue.  When fixed, it can be added to DEFAULT.
And allowing per-feature control allows anyone to do precisely what they want.
"Effort" increases as the number of interested users decreases.

Using defines in the init code will be a problem if there is more than one kind of asic involved or a single asic that the user wants to use with different parameters.  However, this doesn't seem to be a high priority.
If we do want to worry about it, then we'll need to put the values into the amdgpu_gfx struct.

regards,
davep

> -----Original Message-----
> From: Deucher, Alexander
> Sent: Tuesday, June 06, 2017 6:16 PM
> To: Panariti, David <David.Panariti at amd.com>; amd-
> gfx at lists.freedesktop.org
> Subject: RE: [PATCH 3/3] drm/amdgpu: Add kernel parameter to control use
> of ECC/EDC.
> 
> > -----Original Message-----
> > From: Panariti, David
> > Sent: Tuesday, June 06, 2017 5:50 PM
> > To: Deucher, Alexander; amd-gfx at lists.freedesktop.org
> > Subject: RE: [PATCH 3/3] drm/amdgpu: Add kernel parameter to control
> > use of ECC/EDC.
> >
> > Rather than inlining this in a number of places, Re verbosity:
> > I've worked in embedded environments and when dealing with
> > intermittent problems it's nice to have all of the information ASAP
> > rather than waiting for a problem to reoccur, especially if it's very
> intermittent.
> > I would've preferred more.
> > Since it only shows up happens on CZ, it adds little to the output.
> > I like to show the reasons why EDC didn't happen, hence the backwards
> > looking messages.
> > In this particular case,  without the "... not requested..." we can't
> > tell if it was the flags or the ring being unready that made us bail earlier.
> 
> I'm fine with a message about EDC either being enabled or disabled, but a
> bunch of random debug statements along the way are too much.  They tend
> to just cause confusion and clutter up the logs.
> 
> >
> > > -----Original Message-----
> > > From: Deucher, Alexander
> > > Sent: Tuesday, June 06, 2017 5:22 PM
> > > To: Panariti, David <David.Panariti at amd.com>; amd-
> > > gfx at lists.freedesktop.org
> > > Cc: Panariti, David <David.Panariti at amd.com>
> > > Subject: RE: [PATCH 3/3] drm/amdgpu: Add kernel parameter to control
> > use
> > > of ECC/EDC.
> > >
> > > > -----Original Message-----
> > > > From: amd-gfx [mailto:amd-gfx-bounces at lists.freedesktop.org] On
> > Behalf
> > > > Of David Panariti
> > > > Sent: Tuesday, June 06, 2017 4:33 PM
> > > > To: amd-gfx at lists.freedesktop.org
> > > > Cc: Panariti, David
> > > > Subject: [PATCH 3/3] drm/amdgpu: Add kernel parameter to control
> > > > use of ECC/EDC.
> > > >
> > > > Allow various kinds of memory integrity methods (e.g. ECC/EDC) to
> > > > be enabled or disabled.  By default, all features are disabled.
> > > >
> > > > EDC is Error Detection and Correction.  It can detect ECC errors
> > > > and do 0 or more of: count SEC (single error corrected) and DED
> > > > (double error detected, i.e. uncorrected ECC error), halt the
> > > > affected block,
> > > interrupt the CPU.
> > > > Currently, only counting errors is supported.
> > > >
> > > > Fixed a whitespace error.
> > > >
> > > > Signed-off-by: David Panariti <David.Panariti at amd.com>
> > > > ---
> > > >  drivers/gpu/drm/amd/amdgpu/amdgpu.h      |  1 +
> > > >  drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c  |  4 ++++
> > > >  drivers/gpu/drm/amd/amdgpu/gfx_v8_0.c    | 28
> > > > ++++++++++++++++++++++++++--
> > > >  drivers/gpu/drm/amd/include/amd_shared.h | 14 ++++++++++++++
> > > >  4 files changed, 45 insertions(+), 2 deletions(-)
> > > >
> > > > diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu.h
> > > > b/drivers/gpu/drm/amd/amdgpu/amdgpu.h
> > > > index a6f51eb..3e930ee 100644
> > > > --- a/drivers/gpu/drm/amd/amdgpu/amdgpu.h
> > > > +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu.h
> > > > @@ -116,6 +116,7 @@ extern int amdgpu_cntl_sb_buf_per_se;  extern
> > int
> > > > amdgpu_param_buf_per_se;  extern int amdgpu_job_hang_limit;
> > extern
> > > > int amdgpu_lbpw;
> > > > +extern unsigned amdgpu_ecc_flags;
> > > >
> > > >  #define AMDGPU_DEFAULT_GTT_SIZE_MB		3072ULL /*
> > 3GB by
> > > > default */
> > > >  #define AMDGPU_WAIT_IDLE_TIMEOUT_IN_MS	        3000
> > > > diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c
> > > > b/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c
> > > > index c4825ff..972660d 100644
> > > > --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c
> > > > +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c
> > > > @@ -115,6 +115,7 @@ int amdgpu_cntl_sb_buf_per_se = 0;  int
> > > > amdgpu_param_buf_per_se = 0;  int amdgpu_job_hang_limit = 0;  int
> > > > amdgpu_lbpw = -1;
> > > > +unsigned amdgpu_ecc_flags = 0;
> > >
> > > I'd suggest setting amdgpu_ecc_flags to AMD_ECC_SUPPORT_BEST by
> > > default.  That can be our asic specific default setting.  In the
> > > case of CZ, that will be disabled until we decide to enable EDC by default.
> > [davep] I'm confused.  ECC...BEST will cause EDC to be enabled.
> > I used ECC as the generic term for ECC and EDC, since ECC seems more
> > basic (EDC is built on top of ECC).
> > If I understand you, we can't do what you want with the current setup.
> 
> I'm saying we make ECC_BEST the default config (feel free to re-name if
> ECC_DEFAULT).  Each asic can have a different default depending on what
> features are ready.  So for CZ, we'd make ECC_BEST equivalent to disabling
> ECC for now.  If a user wants to force it on, they can set ECC_EDC.  Once EDC
> is stable on CZ, we can make ECC_BEST be equivalent to ECC_EDC.  The way
> the default (ECC_BEST) always maps to the best available combination in that
> version of the driver.
> 
> E.g., in the current gfx8 code:
> 
> if ((amdgpu_ecc_flags & ECC_EDC) && edc_fuse_enabled) {
>     // enable EDC
>     goto enable_edc;
> } else {
>     // disable EDC
>     return 0;
> }
> 
> Then if we want to enable EDC by default, we'd change the code like so:
> 
> if (((amdgpu_ecc_flags == ECC_EDC) || (amdgpu_ecc_flags & ECC_EDC)) &&
> edc_fuse_enabled) {
>     // enable EDC
>     goto enable_edc;
> } else {
>     // disable EDC
>     return 0;
> }
> 
> That way we can have different ECC features enabled by default for each asic
> family without needing to specify command line options.
> 
> Alex
> 
> >
> > >
> > > >
> > > >  MODULE_PARM_DESC(vramlimit, "Restrict VRAM for testing, in
> > > > megabytes");  module_param_named(vramlimit, amdgpu_vram_limit,
> > int,
> > > > 0600); @@ -795,6 +796,9 @@ static struct pci_driver
> > > > amdgpu_kms_pci_driver = {
> > > >  	.driver.pm = &amdgpu_pm_ops,
> > > >  };
> > > >
> > > > +MODULE_PARM_DESC(ecc_flags, "ECC/EDC enable flags (0 = disable
> > > > ECC/EDC (default))");
> > > > +module_param_named(ecc_flags, amdgpu_ecc_flags, uint, 0444);
> > > > +
> > > >
> > > >
> > > >  static int __init amdgpu_init(void) diff --git
> > > > a/drivers/gpu/drm/amd/amdgpu/gfx_v8_0.c
> > > > b/drivers/gpu/drm/amd/amdgpu/gfx_v8_0.c
> > > > index 46e766e..3b5685c 100644
> > > > --- a/drivers/gpu/drm/amd/amdgpu/gfx_v8_0.c
> > > > +++ b/drivers/gpu/drm/amd/amdgpu/gfx_v8_0.c
> > > > @@ -1824,7 +1824,17 @@ static int
> > > > gfx_v8_0_do_edc_gpr_workarounds(struct amdgpu_device *adev)
> > > >  	if (dis_bit) {
> > > >  		/* On Carrizo, EDC may be permanently disabled by a fuse. */
> > > >  		DRM_INFO("CZ EDC hardware is disabled, GC_EDC_CONFIG:
> > > > 0x%08x.\n",
> > > > -			tmp);
> > > > +			 tmp);
> > > > +		return 0;
> > > > +	}
> > > > +
> > > > +	/*
> > > > +	 * Check if EDC has been requested by a kernel parameter.
> > > > +	 * For Carrizo, EDC is the best/safest mode WRT error handling.
> > > > +	 */
> > > > +	if (!(amdgpu_ecc_flags
> > > > +	      & (AMD_ECC_SUPPORT_BEST | AMD_ECC_SUPPORT_EDC))) {
> > > > +		DRM_INFO("EDC support has not been requested.\n");
> > >
> > > Can drop this debugging statement or make it DRM_DEBUG.
> > [davep] See top post.
> > >
> > > >  		return 0;
> > > >  	}
> > > >
> > > > @@ -1962,6 +1972,20 @@ static int
> > > > gfx_v8_0_do_edc_gpr_workarounds(struct amdgpu_device *adev)
> > > >  		goto fail;
> > > >  	}
> > > >
> > > > +	/* 00 - GB_EDC_DED_MODE_LOG: Count DED errors but do not halt
> > > > */
> > > > +	tmp = REG_SET_FIELD(tmp, GB_EDC_MODE, DED_MODE, 0);
> > > > +	/* Do not propagate the errors to the next block. */
> > > > +	tmp = REG_SET_FIELD(tmp, GB_EDC_MODE, PROP_FED, 0);
> > > > +	WREG32(mmGB_EDC_MODE, tmp);
> > > > +
> > > > +	tmp = RREG32(mmCC_GC_EDC_CONFIG);
> > > > +
> > > > +	/*
> > > > +	 * Clear EDC_DISABLE bit so the counters are available.
> > > > +	 */
> > > > +	tmp = REG_SET_FIELD(tmp, CC_GC_EDC_CONFIG, DIS_EDC, 0);
> > > > +	WREG32(mmCC_GC_EDC_CONFIG, tmp);
> > > > +
> > > >  	gfx_v8_0_edc_clear_counters(adev);
> > > >
> > > >  fail:
> > > > @@ -5548,7 +5572,7 @@ static int gfx_v8_0_pre_soft_reset(void
> > *handle)
> > > >  		gfx_v8_0_cp_compute_enable(adev, false);
> > > >  	}
> > > >
> > > > -       return 0;
> > > > +	return 0;
> > > >  }
> > > >
> > > >  static int gfx_v8_0_soft_reset(void *handle) diff --git
> > > > a/drivers/gpu/drm/amd/include/amd_shared.h
> > > > b/drivers/gpu/drm/amd/include/amd_shared.h
> > > > index 0f58e95..1cf30aa 100644
> > > > --- a/drivers/gpu/drm/amd/include/amd_shared.h
> > > > +++ b/drivers/gpu/drm/amd/include/amd_shared.h
> > > > @@ -187,6 +187,20 @@ enum amd_fan_ctrl_mode {
> > > >  #define AMD_PG_SUPPORT_GFX_QUICK_MG		(1 << 11)
> > > >  #define AMD_PG_SUPPORT_GFX_PIPELINE		(1 << 12)
> > > >
> > > > +/*
> > > > + * ECC flags
> > > > + * Allows the user to choose what kind of error
> > > > +detection/correction is
> > > > used.
> > > > + * Currently, EDC is supported on Carrizo.
> > > > + *
> > > > + * The AMD_ECC_SUPPORT_BEST bit is used to allow a user to have
> > > > + the
> > > > driver
> > > > + * set what it thinks is best/safest mode.  This may not be the
> > > > +same as the
> > > > + * default, depending on the GPU and the application.
> > > > + * Using a single bit makes it easy to request the best support
> > > > +without
> > > > + * needing to know all currently supported modes.
> > > > + */
> > > > +#define AMD_ECC_SUPPORT_BEST			(1 << 0)
> > > > +#define AMD_ECC_SUPPORT_EDC			(1 << 1)
> > > > +
> > > >  enum amd_pm_state_type {
> > > >  	/* not used for dpm */
> > > >  	POWER_STATE_TYPE_DEFAULT,
> > > > --
> > > > 2.7.4
> > > >
> > > > _______________________________________________
> > > > amd-gfx mailing list
> > > > amd-gfx at lists.freedesktop.org
> > > > https://lists.freedesktop.org/mailman/listinfo/amd-gfx


More information about the amd-gfx mailing list