[PATCHv2 2/2] drm/amdgpu: Register MCE notifier for Aldebaran RAS

Wed Sep 22 19:43:58 UTC 2021

[AMD Official Use Only]

> -----Original Message-----
> From: Borislav Petkov <bp at alien8.de>
> Sent: Wednesday, September 22, 2021 7:41 AM
> To: Joshi, Mukul <Mukul.Joshi at amd.com>
> Cc: linux-edac at vger.kernel.org; x86 at kernel.org; linux-kernel at vger.kernel.org;
> mingo at redhat.com; mchehab at kernel.org; Ghannam, Yazen
> <Yazen.Ghannam at amd.com>; amd-gfx at lists.freedesktop.org
> Subject: Re: [PATCHv2 2/2] drm/amdgpu: Register MCE notifier for Aldebaran
> RAS
> 
> [CAUTION: External Email]
> 
> On Sun, Sep 12, 2021 at 10:13:11PM -0400, Mukul Joshi wrote:
> > On Aldebaran, GPU driver will handle bad page retirement even though
> > UMC is host managed. As a result, register a bad page retirement
> > handler on the mce notifier chain to retire bad pages on Aldebaran.
> >
> > v1->v2:
> > - Use smca_get_bank_type() to determine MCA bank.
> > - Envelope the changes under #ifdef CONFIG_X86_MCE_AMD.
> > - Use MCE_PRIORITY_UC instead of MCE_PRIO_ACCEL as we are
> >   only handling uncorrectable errors.
> > - Use macros to determine UMC instance and channel instance
> >   where the uncorrectable error occured.
> > - Update the headline.
> 
> Same note as for the previous patch.
> 
Acked.

> > +static int amdgpu_bad_page_notifier(struct notifier_block *nb,
> > +                                 unsigned long val, void *data) {
> > +     struct mce *m = (struct mce *)data;
> > +     struct amdgpu_device *adev = NULL;
> > +     uint32_t gpu_id = 0;
> > +     uint32_t umc_inst = 0;
> > +     uint32_t ch_inst, channel_index = 0;
> > +     struct ras_err_data err_data = {0, 0, 0, NULL};
> > +     struct eeprom_table_record err_rec;
> > +     uint64_t retired_page;
> > +
> > +     /*
> > +      * If the error was generated in UMC_V2, which belongs to GPU UMCs,
> > +      * and error occurred in DramECC (Extended error code = 0) then only
> > +      * process the error, else bail out.
> > +      */
> > +     if (!m || !((smca_get_bank_type(m->bank) == SMCA_UMC_V2) &&
> > +                 (XEC(m->status, 0x1f) == 0x0)))
> > +             return NOTIFY_DONE;
> > +
> > +     /*
> > +      * GPU Id is offset by GPU_ID_OFFSET in MCA_IPID_UMC register.
> > +      */
> > +     gpu_id = GET_MCA_IPID_GPUID(m->ipid) - GPU_ID_OFFSET;
> > +
> > +     adev = find_adev(gpu_id);
> > +     if (!adev) {
> > +             dev_warn(adev->dev, "%s: Unable to find adev for gpu_id: %d\n",
> > +                                  __func__, gpu_id);
> > +             return NOTIFY_DONE;
> > +     }
> > +
> > +     /*
> > +      * If it is correctable error, return.
> > +      */
> > +     if (mce_is_correctable(m)) {
> > +             return NOTIFY_OK;
> > +     }
> 
> This can run before you find_adev().
> 
Acked.

> > +static void amdgpu_register_bad_pages_mca_notifier(void)
> > +{
> > +     /*
> > +      * Register the x86 notifier only once
> > +      * with MCE subsystem.
> > +      */
> > +     if (notifier_registered == false) {
> > +             mce_register_decode_chain(&amdgpu_bad_page_nb);
> > +             notifier_registered = true;
> > +     }
> 
> I have a patchset which will get rid of the need to do this silliness - only if I had
> some time to actually prepare it for submission... :-\
> 
:-\

Thank you.

Regards,
Mukul
> --
> Regards/Gruss,
>     Boris.
> 
> https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Fpeople.
> kernel.org%2Ftglx%2Fnotes-about-
> netiquette&data=04%7C01%7Cmukul.joshi%40amd.com%7C7ae9c87153f7
> 4572712908d97dbdcc7a%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0
> %7C637679076423976867%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAw
> MDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&sdata
> =M5UDrSea0lnEi5%2BhU4ck0zk1dZD9kX4DUoXt95J6dJ4%3D&reserved=0