[PATCH] drm/amdgpu: Register bad page handler for Aldebaran

Thu May 13 03:20:36 UTC 2021

[AMD Official Use Only - Internal Distribution Only]

> -----Original Message-----
> From: Borislav Petkov <bp at alien8.de>
> Sent: Wednesday, May 12, 2021 5:06 PM
> To: Joshi, Mukul <Mukul.Joshi at amd.com>
> Cc: amd-gfx at lists.freedesktop.org; Kasiviswanathan, Harish
> <Harish.Kasiviswanathan at amd.com>; x86-ml <x86 at kernel.org>; lkml <linux-
> kernel at vger.kernel.org>
> Subject: Re: [PATCH] drm/amdgpu: Register bad page handler for Aldebaran
> 
> [CAUTION: External Email]
> 
> On Wed, May 12, 2021 at 07:00:58PM +0000, Joshi, Mukul wrote:
> > SMCA UMCv2 corresponds to GPU's UMC MCA bank and the GPU driver is
> > only interested in errors on GPU UMC.
> 
> So that thing should be called SMCA_GPU_UMC not SMCA_UMC_V2.
> 
> > We cannot know this without is_smca_umc_v2.
> 
> You don't need it - just export smca_get_bank_type() and test the bank type at
> the call site.

Exporting smca_get_bank_type() works fine when CONFIG_X86_MCE_AMD is defined.
I would need to put #ifdef CONFIG_X86_MCE_AMD in my code to compile the amdgpu
driver when CONFIG_X86_MCE_AMD is not defined.
I can avoid all that by using is_smca_umc_v2().
I think it would be cleaner with using is_smca_umc_v2().

> 
> > Maybe. I hope its not too much of a concern if it stays the way it is.
> 
> That was just a suggestion anyway - it is not code I maintain so not my call.
> 
> > I wasn't really sure if I should use the EDAC priority here or create a new one
> for Accelerator devices.
> > I thought using EDAC priority might not be accepted by the maintainers
> > as EDAC and GPU (Accelerator) devices are two different class of devices.
> > That is the reason I create a new one.
> > I am OK to use EDAC priority if that is acceptable.
> 
> I don't know what's acceptable because I still am unclear as to what that thing is
> supposed to do. It seems you are interested only in uncorrectable errors.
> 

You can think of GPU device as a EDAC device here. It is mainly interested in
handling uncorrectable errors.

> How are those errors reported? #MC exception, deferred interrupt, simply
> logged in the bank and we find them by polling?
> 

It is a deferred interrupt that generates an MCE.

> Then, the commit message is talking about some "bad page retirement".
> What does that do? What can the user do when she sees the
> 
> "Uncorrectable error detected in UMC..."
> 
> message?
> 
> It depends on what "retiring" of GPU pages means...
> 
> In any case, dmesg should issue a human-understandable message about the
> recovery action being done and what that means for the user: should she
> replace the GPU, should she ignore, etc, etc.
>

When an uncorrectable error is detected on the GPU UMC, all we are doing is 
determining the physical address where the error occurred and then "retiring"
the page that address belongs to.
By retiring, we mean we reserve the page so that it is not available for allocations
to any applications.
We are providing information to the user by storing all the information about the
retired pages in EEPROM. This can be accessed through sysfs.

Hope it clears what "bad page retirement" is achieving.

Thanks,
Mukul

> > A system can have multiple GPUs and we only want a single notifier
> > registered. I will change the comment to explicitly state this.
> 
> Actually, the notifier registration should be able to return a different retval to
> state that a callback has already been registered but that warns only currently so
> I'm guessing we're stuck with such ugly "workarounds" for their shortcomings.
> 
> I'm gonna take a look whether they can be fixed though so that you don't have
> to do this notifier_registered thing.
> 
> --
> Regards/Gruss,
>     Boris.
> 
> https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Fpeople.
> kernel.org%2Ftglx%2Fnotes-about-
> netiquette&data=04%7C01%7CMukul.Joshi%40amd.com%7C5e305845504
> 14bc53c3008d91589b50b%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0
> %7C637564503481206042%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAw
> MDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&sdata
> =fF2iaSQxAREAV2OwLi%2F8cN9uRuqNNKim2FpWJUBQS98%3D&reserved=
> 0