[PATCH] drm/amdgpu: Register bad page handler for Aldebaran

Thu May 13 23:10:34 UTC 2021

[AMD Official Use Only - Internal Distribution Only]

> -----Original Message-----
> From: Borislav Petkov <bp at alien8.de>
> Sent: Thursday, May 13, 2021 5:53 AM
> To: Joshi, Mukul <Mukul.Joshi at amd.com>
> Cc: amd-gfx at lists.freedesktop.org; Kasiviswanathan, Harish
> <Harish.Kasiviswanathan at amd.com>; x86-ml <x86 at kernel.org>; lkml <linux-
> kernel at vger.kernel.org>
> Subject: Re: [PATCH] drm/amdgpu: Register bad page handler for Aldebaran
> 
> [CAUTION: External Email]
> 
> On Thu, May 13, 2021 at 03:20:36AM +0000, Joshi, Mukul wrote:
> > Exporting smca_get_bank_type() works fine when CONFIG_X86_MCE_AMD is
> defined.
> > I would need to put #ifdef CONFIG_X86_MCE_AMD in my code to compile
> > the amdgpu driver when CONFIG_X86_MCE_AMD is not defined.
> > I can avoid all that by using is_smca_umc_v2().
> > I think it would be cleaner with using is_smca_umc_v2().
> 
> See how smca_get_long_name() is exported and export that function the same
> way.
> 

That's probably not the best example to look at.
smca_get_long_name() is used in drivers/edac/mce_amd.c and this file doesn't
get compiled when CONFIG_X86_MCE_AMD is not defined.

And amdgpu driver has no dependency on CONFIG_X86_MCE_AMD.

So here is one option that we can try:
1. Export smca_get_bank_type().
2. I wrap my entire code in GPU driver with #ifdef CONFIG_X86_MCE_AMD

Will that work for you?

Thanks,
Mukul

> To save you some energy: is_smca_umc_v2() is not going to happen.

> 
> > You can think of GPU device as a EDAC device here. It is mainly
> > interested in handling uncorrectable errors.
> 
> An EDAC "device", as you call it, is not interested in handling UEs. If anything, it
> counts them.
> 
> > It is a deferred interrupt that generates an MCE.
> 
> Is that the same deferred interrupt which calls amd_deferred_error_interrupt() ?
> 
> > When an uncorrectable error is detected on the GPU UMC, all we are
> > doing is determining the physical address where the error occurred and
> > then "retiring" the page that address belongs to.
> 
> What page is that? Normal DRAM page or a page in some special GPU memory?
> 
> > By retiring, we mean we reserve the page so that it is not available
> > for allocations to any applications.
> 
> We do that for normal DRAM memory pages by poisoning them. I hope you
> don't mean that.
> 
> Looking at
> 
> amdgpu_ras_add_bad_pages
> |-> amdgpu_vram_mgr_reserve_range
> 
> that's some VRAM thing so I'm guessing special memory on the GPU.
> 
> If so, what happens with all those "retired" pages when you reboot?
> They're getting used again and potentially trigger the same UEs and the same
> retiring happens?
> 
> > We are providing information to the user by storing all the
> > information about the retired pages in EEPROM. This can be accessed
> > through sysfs.
> 
> Ok, I'm a user and I can access that information through sysfs. What can I do
> with it?
> 
> > Hope it clears what "bad page retirement" is achieving.
> 
> It is getting there.
> 
> Thx.
> 
> --
> Regards/Gruss,
>     Boris.
> 
> https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Fpeople.
> kernel.org%2Ftglx%2Fnotes-about-
> netiquette&data=04%7C01%7CMukul.Joshi%40amd.com%7Cd8c660fce3a2
> 4ce3c6d408d915f4efa6%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%
> 7C637564964013263414%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwM
> DAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&sdata=
> %2BnJ%2B99N%2FRljoHGALimZHZG%2Bmf9jL5zP2eA44I6pbzFY%3D&reser
> ved=0