[PATCH] drm/amdgpu: Add kernel parameter to manage memory error handling.

Thu Apr 13 16:38:10 UTC 2017

On 13/04/17 11:38 AM, Panariti, David wrote:
> + Vilas
> 
>> -----Original Message-----
>> From: Deucher, Alexander
>> Sent: Wednesday, April 12, 2017 9:29 PM
>> To: 'Michel Dänzer' <michel at daenzer.net>; Panariti, David
>> <David.Panariti at amd.com>
>> Cc: amd-gfx at lists.freedesktop.org
>> Subject: RE: [PATCH] drm/amdgpu: Add kernel parameter to manage
>> memory error handling.
>>
>>> -----Original Message-----
>>> From: Michel Dänzer [mailto:michel at daenzer.net]
>>> Sent: Wednesday, April 12, 2017 9:17 PM
>>> To: Panariti, David
>>> Cc: Deucher, Alexander; amd-gfx at lists.freedesktop.org
>>> Subject: Re: [PATCH] drm/amdgpu: Add kernel parameter to manage
>> memory
>>> error handling.
>>>
>>> On 13/04/17 02:38 AM, Panariti, David wrote:
>>>>> From: Michel Dänzer [mailto:michel at daenzer.net]
>>>>>
>>>>>> @@ -212,6 +213,9 @@ module_param_named(cg_mask,
>>>>> amdgpu_cg_mask, uint,
>>>>>> 0444);  MODULE_PARM_DESC(pg_mask, "Powergating flags mask (0 =
>>>>> disable
>>>>>> power gating)");  module_param_named(pg_mask,
>> amdgpu_pg_mask,
>>>>> uint,
>>>>>> 0444);
>>>>>>
>>>>>> +MODULE_PARM_DESC(ecc_mask, "ECC/EDC flags mask (0 = disable
>>>>>> +ECC/EDC)");
>>>>>
>>>>> "0 = disable ECC/EDC" implies that they're enabled by default? Was
>>>>> that already the case before this patch?
>>>>
>>>> [davep] Yes it was, and there was actually a problem in some cases
>>>> where the CZ would hang which is why I added the param. I was
>>>> wondering if it would be better to default to them being off, but I
>>>> wasn't sure how important maintaining original behavior is
>>>> considered. Actually, there are some bugs in the workaround function
>>>> as it is, so it really should default to off.
>>>
>>> I agree. There have been some bug reports about Carrizo hangs, I
>>> wonder if any of those might be related to this.
>>
>> Only the embedded SKUs support EDC.  If they are embedded parts, it could
>> be related.
> 
> [davep]  Sorry for the length, but I wanted all of the details out there for the most informed decision.
> 
> Another thing is that they can go from not hanging to hanging for no discernable reason.
> The KIQ changes, however, have seemed to have fixed it.  For one chip and a few tens of reboots.
> 
> There is also the issue of improperly initialized *gpr registers.
>   From the doc:
> "Due to a hardware condition whereby some shader instructions utilize uninitialized SGPRs and/or VGPRs, the S/VPGR memories must be initialized prior to EDC operation, as, not doing so will cause erroneous counts to show up in the EDC counters."
> I seem to recall Vilas saying it is the poison that isn't reset properly.  But I'm not sure about the actual register contents.
> Vilas?
> 
> I suggest, at a minimum,  checking for cz *and* the EDC fuse.  If so, explicitly disable EDC, run the shaders to zero the *gprs, leave EDC disabled, and merge in the existing new code to zero all of the counters.
> All of the code exists, in one place or another.
> 
> The parameter to enable/disable probably won't be needed until EDC is fully implemented.
> However, EDC can be enabled in a way that simply allows it to count errors.  This has never caused a hang for me.
> The counts are useful for reliability research.  This was one of the goals of the original EDC task.
> A umr script could be written (by interested parties) to read the counters and in fact enable the EDC counters.
> I think this should be done if anyone is interested in the numbers.
> Vilas?  Any R&R work left in this area?  Do you think customers would be interested in doing this on their own?

For the special keeners we could add EDC counters to umr's --top and 
then in theory it'll be included in the log output.

If you can send me info on how to enable/read the counters I can take a 
look at it.

Cheers,
Tom