[PATCH] drm/amdgpu: Add kernel parameter to manage memory error handling.

Panariti, David David.Panariti at amd.com
Thu Apr 13 15:38:43 UTC 2017


+ Vilas

> -----Original Message-----
> From: Deucher, Alexander
> Sent: Wednesday, April 12, 2017 9:29 PM
> To: 'Michel Dänzer' <michel at daenzer.net>; Panariti, David
> <David.Panariti at amd.com>
> Cc: amd-gfx at lists.freedesktop.org
> Subject: RE: [PATCH] drm/amdgpu: Add kernel parameter to manage
> memory error handling.
> 
> > -----Original Message-----
> > From: Michel Dänzer [mailto:michel at daenzer.net]
> > Sent: Wednesday, April 12, 2017 9:17 PM
> > To: Panariti, David
> > Cc: Deucher, Alexander; amd-gfx at lists.freedesktop.org
> > Subject: Re: [PATCH] drm/amdgpu: Add kernel parameter to manage
> memory
> > error handling.
> >
> > On 13/04/17 02:38 AM, Panariti, David wrote:
> > >> From: Michel Dänzer [mailto:michel at daenzer.net]
> > >>
> > >>> @@ -212,6 +213,9 @@ module_param_named(cg_mask,
> > >> amdgpu_cg_mask, uint,
> > >>> 0444);  MODULE_PARM_DESC(pg_mask, "Powergating flags mask (0 =
> > >> disable
> > >>> power gating)");  module_param_named(pg_mask,
> amdgpu_pg_mask,
> > >> uint,
> > >>> 0444);
> > >>>
> > >>> +MODULE_PARM_DESC(ecc_mask, "ECC/EDC flags mask (0 = disable
> > >>> +ECC/EDC)");
> > >>
> > >> "0 = disable ECC/EDC" implies that they're enabled by default? Was
> > >> that already the case before this patch?
> > >
> > > [davep] Yes it was, and there was actually a problem in some cases
> > > where the CZ would hang which is why I added the param. I was
> > > wondering if it would be better to default to them being off, but I
> > > wasn't sure how important maintaining original behavior is
> > > considered. Actually, there are some bugs in the workaround function
> > > as it is, so it really should default to off.
> >
> > I agree. There have been some bug reports about Carrizo hangs, I
> > wonder if any of those might be related to this.
> 
> Only the embedded SKUs support EDC.  If they are embedded parts, it could
> be related.

[davep]  Sorry for the length, but I wanted all of the details out there for the most informed decision.

Another thing is that they can go from not hanging to hanging for no discernable reason.  
The KIQ changes, however, have seemed to have fixed it.  For one chip and a few tens of reboots.

There is also the issue of improperly initialized *gpr registers.
 From the doc:
"Due to a hardware condition whereby some shader instructions utilize uninitialized SGPRs and/or VGPRs, the S/VPGR memories must be initialized prior to EDC operation, as, not doing so will cause erroneous counts to show up in the EDC counters."
I seem to recall Vilas saying it is the poison that isn't reset properly.  But I'm not sure about the actual register contents.
Vilas?

I suggest, at a minimum,  checking for cz *and* the EDC fuse.  If so, explicitly disable EDC, run the shaders to zero the *gprs, leave EDC disabled, and merge in the existing new code to zero all of the counters.
All of the code exists, in one place or another.

The parameter to enable/disable probably won't be needed until EDC is fully implemented.  
However, EDC can be enabled in a way that simply allows it to count errors.  This has never caused a hang for me.
The counts are useful for reliability research.  This was one of the goals of the original EDC task.
A umr script could be written (by interested parties) to read the counters and in fact enable the EDC counters.
I think this should be done if anyone is interested in the numbers.  
Vilas?  Any R&R work left in this area?  Do you think customers would be interested in doing this on their own?

davep
> 
> Alex



More information about the amd-gfx mailing list