[PATCH 3/3] drm/amdgpu: Add kernel parameter to control use of ECC/EDC.

Bridgman, John John.Bridgman at amd.com
Mon Jun 26 19:12:15 UTC 2017


Agreed... one person's "best" is another person's "OMG I didn't want that". IMO we should have bits correspond to specific options as much as possible, modulo HW capabilities.

>-----Original Message-----
>From: amd-gfx [mailto:amd-gfx-bounces at lists.freedesktop.org] On Behalf Of
>Xie, AlexBin
>Sent: Monday, June 26, 2017 2:12 PM
>To: Panariti, David; Deucher, Alexander; amd-gfx at lists.freedesktop.org
>Subject: RE: [PATCH 3/3] drm/amdgpu: Add kernel parameter to control use
>of ECC/EDC.
>
>Hi,
>
>I have not checked the background of this discussion very closely yet. And you
>might have known the following.
>
>Customers may not want the default setting to change meaning. This is like an
>API.
>Example: The application and its environment is already set up and tested.
>Then if customer updates driver, suddenly driver has some new behavior?
>Certain serious application definitely does not accept this.
>
>IMHO, it is better to avoid vague concepts like "best". It will become rather
>difficult to define what is best when there are multiple customers with
>different requirements. Driver is to provide a feature or mechanism. "Best"
>sounds like a policy or a preference from driver side.
>
>In my pass work, I generally use default for two cases:
>1. The default is the most conservative option, which must work. Then the
>application can choose advanced features by choosing other parameter
>value/option.
>2. The default parameter is the compatible behavior before introducing this
>parameter/option.
>
>Regards,
>Alex Bin
>
>
>-----Original Message-----
>From: amd-gfx [mailto:amd-gfx-bounces at lists.freedesktop.org] On Behalf Of
>Panariti, David
>Sent: Monday, June 26, 2017 12:06 PM
>To: Deucher, Alexander <Alexander.Deucher at amd.com>; amd-
>gfx at lists.freedesktop.org
>Subject: RE: [PATCH 3/3] drm/amdgpu: Add kernel parameter to control use
>of ECC/EDC.
>
>>> > I'd suggest setting amdgpu_ecc_flags to AMD_ECC_SUPPORT_BEST by
>>> > default.  That can be our asic specific default setting.  In the
>>> > case of CZ, that will be disabled until we decide to enable EDC by default.
>>> [davep] I'm confused.  ECC...BEST will cause EDC to be enabled.
>>> I used ECC as the generic term for ECC and EDC, since ECC seems more
>>> basic (EDC is built on top of ECC).
>>> If I understand you, we can't do what you want with the current setup.
>
>>I'm saying we make ECC_BEST the default config (feel free to re-name if
>>ECC_DEFAULT).  Each asic can have a different default depending on what
>>features are ready.  So for CZ, we'd make ECC_BEST equivalent to
>>disabling ECC for now.  If a user wants to force it on, they can set ECC_EDC.
>Once EDC is stable on CZ, we can make ECC_BEST be equivalent to ECC_EDC.
>The way the default (ECC_BEST) always maps to the best available
>combination in that version of the driver.
>
>That's not how I meant it to work WRT BEST.
>Each asic will have a DEFAULT, but that isn't what BEST means.
>CZ is a good example (when fully implemented).  DEFAULT for CZ is everything
>except HALT, since, IMO opinion, most people do not want to hang or reboot.
>BEST for CZ would be everything a person most interested in reliability would
>want, which IMO, includes HALT/reboot.
>Similar is if something like performance degradation is really bad, DEFAULT
>would be OFF. BEST would be ON, e.g., if the user's app doesn't trigger the
>performance problem.
>The BEST bit is in a fixed position, so that customers don't need to worry what
>bits are needed for the most reliable performance (in our opinion) on a given
>asic.
>And if a customer (or developer) wants some arbitrary set of features, they
>can set bits as they want.
>
>I think DEFAULT will make most people happy.
>BEST allows people who are interested in everything they can get, regardless
>of any issues that brings with it. It is requested simply by using a fixed param
>value (0x01) for any asic.
>This probably should not include features that have any kind of fatal flaw such
>as the Vega10 HBM ECC issue.  When fixed, it can be added to DEFAULT.
>And allowing per-feature control allows anyone to do precisely what they
>want.
>"Effort" increases as the number of interested users decreases.
>
>Using defines in the init code will be a problem if there is more than one kind
>of asic involved or a single asic that the user wants to use with different
>parameters.  However, this doesn't seem to be a high priority.
>If we do want to worry about it, then we'll need to put the values into the
>amdgpu_gfx struct.
>
>regards,
>davep
>
>> -----Original Message-----
>> From: Deucher, Alexander
>> Sent: Tuesday, June 06, 2017 6:16 PM
>> To: Panariti, David <David.Panariti at amd.com>; amd-
>> gfx at lists.freedesktop.org
>> Subject: RE: [PATCH 3/3] drm/amdgpu: Add kernel parameter to control
>> use of ECC/EDC.
>>
>> > -----Original Message-----
>> > From: Panariti, David
>> > Sent: Tuesday, June 06, 2017 5:50 PM
>> > To: Deucher, Alexander; amd-gfx at lists.freedesktop.org
>> > Subject: RE: [PATCH 3/3] drm/amdgpu: Add kernel parameter to control
>> > use of ECC/EDC.
>> >
>> > Rather than inlining this in a number of places, Re verbosity:
>> > I've worked in embedded environments and when dealing with
>> > intermittent problems it's nice to have all of the information ASAP
>> > rather than waiting for a problem to reoccur, especially if it's
>> > very
>> intermittent.
>> > I would've preferred more.
>> > Since it only shows up happens on CZ, it adds little to the output.
>> > I like to show the reasons why EDC didn't happen, hence the
>> > backwards looking messages.
>> > In this particular case,  without the "... not requested..." we
>> > can't tell if it was the flags or the ring being unready that made us bail
>earlier.
>>
>> I'm fine with a message about EDC either being enabled or disabled,
>> but a bunch of random debug statements along the way are too much.
>> They tend to just cause confusion and clutter up the logs.
>>
>> >
>> > > -----Original Message-----
>> > > From: Deucher, Alexander
>> > > Sent: Tuesday, June 06, 2017 5:22 PM
>> > > To: Panariti, David <David.Panariti at amd.com>; amd-
>> > > gfx at lists.freedesktop.org
>> > > Cc: Panariti, David <David.Panariti at amd.com>
>> > > Subject: RE: [PATCH 3/3] drm/amdgpu: Add kernel parameter to
>> > > control
>> > use
>> > > of ECC/EDC.
>> > >
>> > > > -----Original Message-----
>> > > > From: amd-gfx [mailto:amd-gfx-bounces at lists.freedesktop.org] On
>> > Behalf
>> > > > Of David Panariti
>> > > > Sent: Tuesday, June 06, 2017 4:33 PM
>> > > > To: amd-gfx at lists.freedesktop.org
>> > > > Cc: Panariti, David
>> > > > Subject: [PATCH 3/3] drm/amdgpu: Add kernel parameter to control
>> > > > use of ECC/EDC.
>> > > >
>> > > > Allow various kinds of memory integrity methods (e.g. ECC/EDC)
>> > > > to be enabled or disabled.  By default, all features are disabled.
>> > > >
>> > > > EDC is Error Detection and Correction.  It can detect ECC errors
>> > > > and do 0 or more of: count SEC (single error corrected) and DED
>> > > > (double error detected, i.e. uncorrected ECC error), halt the
>> > > > affected block,
>> > > interrupt the CPU.
>> > > > Currently, only counting errors is supported.
>> > > >
>> > > > Fixed a whitespace error.
>> > > >
>> > > > Signed-off-by: David Panariti <David.Panariti at amd.com>
>> > > > ---
>> > > >  drivers/gpu/drm/amd/amdgpu/amdgpu.h      |  1 +
>> > > >  drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c  |  4 ++++
>> > > >  drivers/gpu/drm/amd/amdgpu/gfx_v8_0.c    | 28
>> > > > ++++++++++++++++++++++++++--
>> > > >  drivers/gpu/drm/amd/include/amd_shared.h | 14 ++++++++++++++
>> > > >  4 files changed, 45 insertions(+), 2 deletions(-)
>> > > >
>> > > > diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu.h
>> > > > b/drivers/gpu/drm/amd/amdgpu/amdgpu.h
>> > > > index a6f51eb..3e930ee 100644
>> > > > --- a/drivers/gpu/drm/amd/amdgpu/amdgpu.h
>> > > > +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu.h
>> > > > @@ -116,6 +116,7 @@ extern int amdgpu_cntl_sb_buf_per_se;
>> > > > extern
>> > int
>> > > > amdgpu_param_buf_per_se;  extern int amdgpu_job_hang_limit;
>> > extern
>> > > > int amdgpu_lbpw;
>> > > > +extern unsigned amdgpu_ecc_flags;
>> > > >
>> > > >  #define AMDGPU_DEFAULT_GTT_SIZE_MB		3072ULL /*
>> > 3GB by
>> > > > default */
>> > > >  #define AMDGPU_WAIT_IDLE_TIMEOUT_IN_MS	        3000
>> > > > diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c
>> > > > b/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c
>> > > > index c4825ff..972660d 100644
>> > > > --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c
>> > > > +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c
>> > > > @@ -115,6 +115,7 @@ int amdgpu_cntl_sb_buf_per_se = 0;  int
>> > > > amdgpu_param_buf_per_se = 0;  int amdgpu_job_hang_limit = 0;
>> > > > int amdgpu_lbpw = -1;
>> > > > +unsigned amdgpu_ecc_flags = 0;
>> > >
>> > > I'd suggest setting amdgpu_ecc_flags to AMD_ECC_SUPPORT_BEST by
>> > > default.  That can be our asic specific default setting.  In the
>> > > case of CZ, that will be disabled until we decide to enable EDC by default.
>> > [davep] I'm confused.  ECC...BEST will cause EDC to be enabled.
>> > I used ECC as the generic term for ECC and EDC, since ECC seems more
>> > basic (EDC is built on top of ECC).
>> > If I understand you, we can't do what you want with the current setup.
>>
>> I'm saying we make ECC_BEST the default config (feel free to re-name
>> if ECC_DEFAULT).  Each asic can have a different default depending on
>> what features are ready.  So for CZ, we'd make ECC_BEST equivalent to
>> disabling ECC for now.  If a user wants to force it on, they can set
>> ECC_EDC.  Once EDC is stable on CZ, we can make ECC_BEST be equivalent
>> to ECC_EDC.  The way the default (ECC_BEST) always maps to the best
>> available combination in that version of the driver.
>>
>> E.g., in the current gfx8 code:
>>
>> if ((amdgpu_ecc_flags & ECC_EDC) && edc_fuse_enabled) {
>>     // enable EDC
>>     goto enable_edc;
>> } else {
>>     // disable EDC
>>     return 0;
>> }
>>
>> Then if we want to enable EDC by default, we'd change the code like so:
>>
>> if (((amdgpu_ecc_flags == ECC_EDC) || (amdgpu_ecc_flags & ECC_EDC)) &&
>> edc_fuse_enabled) {
>>     // enable EDC
>>     goto enable_edc;
>> } else {
>>     // disable EDC
>>     return 0;
>> }
>>
>> That way we can have different ECC features enabled by default for
>> each asic family without needing to specify command line options.
>>
>> Alex
>>
>> >
>> > >
>> > > >
>> > > >  MODULE_PARM_DESC(vramlimit, "Restrict VRAM for testing, in
>> > > > megabytes");  module_param_named(vramlimit, amdgpu_vram_limit,
>> > int,
>> > > > 0600); @@ -795,6 +796,9 @@ static struct pci_driver
>> > > > amdgpu_kms_pci_driver = {
>> > > >  	.driver.pm = &amdgpu_pm_ops,
>> > > >  };
>> > > >
>> > > > +MODULE_PARM_DESC(ecc_flags, "ECC/EDC enable flags (0 = disable
>> > > > ECC/EDC (default))");
>> > > > +module_param_named(ecc_flags, amdgpu_ecc_flags, uint, 0444);
>> > > > +
>> > > >
>> > > >
>> > > >  static int __init amdgpu_init(void) diff --git
>> > > > a/drivers/gpu/drm/amd/amdgpu/gfx_v8_0.c
>> > > > b/drivers/gpu/drm/amd/amdgpu/gfx_v8_0.c
>> > > > index 46e766e..3b5685c 100644
>> > > > --- a/drivers/gpu/drm/amd/amdgpu/gfx_v8_0.c
>> > > > +++ b/drivers/gpu/drm/amd/amdgpu/gfx_v8_0.c
>> > > > @@ -1824,7 +1824,17 @@ static int
>> > > > gfx_v8_0_do_edc_gpr_workarounds(struct amdgpu_device *adev)
>> > > >  	if (dis_bit) {
>> > > >  		/* On Carrizo, EDC may be permanently disabled by a
>fuse. */
>> > > >  		DRM_INFO("CZ EDC hardware is disabled,
>GC_EDC_CONFIG:
>> > > > 0x%08x.\n",
>> > > > -			tmp);
>> > > > +			 tmp);
>> > > > +		return 0;
>> > > > +	}
>> > > > +
>> > > > +	/*
>> > > > +	 * Check if EDC has been requested by a kernel parameter.
>> > > > +	 * For Carrizo, EDC is the best/safest mode WRT error
>handling.
>> > > > +	 */
>> > > > +	if (!(amdgpu_ecc_flags
>> > > > +	      & (AMD_ECC_SUPPORT_BEST |
>AMD_ECC_SUPPORT_EDC))) {
>> > > > +		DRM_INFO("EDC support has not been
>requested.\n");
>> > >
>> > > Can drop this debugging statement or make it DRM_DEBUG.
>> > [davep] See top post.
>> > >
>> > > >  		return 0;
>> > > >  	}
>> > > >
>> > > > @@ -1962,6 +1972,20 @@ static int
>> > > > gfx_v8_0_do_edc_gpr_workarounds(struct amdgpu_device *adev)
>> > > >  		goto fail;
>> > > >  	}
>> > > >
>> > > > +	/* 00 - GB_EDC_DED_MODE_LOG: Count DED errors but do
>not halt
>> > > > */
>> > > > +	tmp = REG_SET_FIELD(tmp, GB_EDC_MODE, DED_MODE, 0);
>> > > > +	/* Do not propagate the errors to the next block. */
>> > > > +	tmp = REG_SET_FIELD(tmp, GB_EDC_MODE, PROP_FED, 0);
>> > > > +	WREG32(mmGB_EDC_MODE, tmp);
>> > > > +
>> > > > +	tmp = RREG32(mmCC_GC_EDC_CONFIG);
>> > > > +
>> > > > +	/*
>> > > > +	 * Clear EDC_DISABLE bit so the counters are available.
>> > > > +	 */
>> > > > +	tmp = REG_SET_FIELD(tmp, CC_GC_EDC_CONFIG, DIS_EDC,
>0);
>> > > > +	WREG32(mmCC_GC_EDC_CONFIG, tmp);
>> > > > +
>> > > >  	gfx_v8_0_edc_clear_counters(adev);
>> > > >
>> > > >  fail:
>> > > > @@ -5548,7 +5572,7 @@ static int gfx_v8_0_pre_soft_reset(void
>> > *handle)
>> > > >  		gfx_v8_0_cp_compute_enable(adev, false);
>> > > >  	}
>> > > >
>> > > > -       return 0;
>> > > > +	return 0;
>> > > >  }
>> > > >
>> > > >  static int gfx_v8_0_soft_reset(void *handle) diff --git
>> > > > a/drivers/gpu/drm/amd/include/amd_shared.h
>> > > > b/drivers/gpu/drm/amd/include/amd_shared.h
>> > > > index 0f58e95..1cf30aa 100644
>> > > > --- a/drivers/gpu/drm/amd/include/amd_shared.h
>> > > > +++ b/drivers/gpu/drm/amd/include/amd_shared.h
>> > > > @@ -187,6 +187,20 @@ enum amd_fan_ctrl_mode {
>> > > >  #define AMD_PG_SUPPORT_GFX_QUICK_MG		(1 << 11)
>> > > >  #define AMD_PG_SUPPORT_GFX_PIPELINE		(1 << 12)
>> > > >
>> > > > +/*
>> > > > + * ECC flags
>> > > > + * Allows the user to choose what kind of error
>> > > > +detection/correction is
>> > > > used.
>> > > > + * Currently, EDC is supported on Carrizo.
>> > > > + *
>> > > > + * The AMD_ECC_SUPPORT_BEST bit is used to allow a user to have
>> > > > + the
>> > > > driver
>> > > > + * set what it thinks is best/safest mode.  This may not be the
>> > > > +same as the
>> > > > + * default, depending on the GPU and the application.
>> > > > + * Using a single bit makes it easy to request the best support
>> > > > +without
>> > > > + * needing to know all currently supported modes.
>> > > > + */
>> > > > +#define AMD_ECC_SUPPORT_BEST			(1 << 0)
>> > > > +#define AMD_ECC_SUPPORT_EDC			(1 << 1)
>> > > > +
>> > > >  enum amd_pm_state_type {
>> > > >  	/* not used for dpm */
>> > > >  	POWER_STATE_TYPE_DEFAULT,
>> > > > --
>> > > > 2.7.4
>> > > >
>> > > > _______________________________________________
>> > > > amd-gfx mailing list
>> > > > amd-gfx at lists.freedesktop.org
>> > > > https://lists.freedesktop.org/mailman/listinfo/amd-gfx
>_______________________________________________
>amd-gfx mailing list
>amd-gfx at lists.freedesktop.org
>https://lists.freedesktop.org/mailman/listinfo/amd-gfx
>_______________________________________________
>amd-gfx mailing list
>amd-gfx at lists.freedesktop.org
>https://lists.freedesktop.org/mailman/listinfo/amd-gfx


More information about the amd-gfx mailing list