<html> <head> <meta http-equiv="Content-Type" content="text/html; charset=iso-8859-1"> <style type="text/css" style="display:none;"> P {margin-top:0;margin-bottom:0;} </style> </head> <body dir="ltr"> <p style="font-family:Arial;font-size:10pt;color:#0000FF;margin:5pt;" align="Left"> [AMD Official Use Only]<br> </p> <br> <div> <div style="font-family:Calibri,Arial,Helvetica,sans-serif; font-size:12pt; color:rgb(0,0,0)"> The patch looks good for me, but it's better to add comment in amdgpu_register_bad_pages_mca_notifier to explain why we need to reserve GPU info instead of using mgpu_info list, with this addressed, the patch is:</div> <div style="font-family:Calibri,Arial,Helvetica,sans-serif; font-size:12pt; color:rgb(0,0,0)"> <br> </div> <div style="font-family:Calibri,Arial,Helvetica,sans-serif; font-size:12pt; color:rgb(0,0,0)"> <span style="margin:0px;font-family:Calibri, sans-serif;background-color:white">Reviewed-by: Tao Zhou <</span><a href="mailto:tao.zhou1@amd.com" target="_blank" rel="noopener noreferrer" data-auth="NotApplicable" data-linkindex="0" style="margin:0px;font-family:Calibri, sans-serif;background-color:white">tao.zhou1@amd.com</a><span style="margin:0px;font-family:Calibri, sans-serif;background-color:white">></span><br> </div> <div> <div id="appendonsend"></div> <div style="font-family:Calibri,Arial,Helvetica,sans-serif; font-size:12pt; color:rgb(0,0,0)"> <br> </div> <hr tabindex="-1" style="display:inline-block; width:98%"> <div id="divRplyFwdMsg" dir="ltr"><font face="Calibri, sans-serif" color="#000000" style="font-size:11pt"><b>From:</b> Joshi, Mukul <Mukul.Joshi@amd.com><br> <b>Sent:</b> Tuesday, October 12, 2021 10:33 AM<br> <b>To:</b> amd-gfx@lists.freedesktop.org <amd-gfx@lists.freedesktop.org><br> <b>Cc:</b> Zhou1, Tao <Tao.Zhou1@amd.com>; Clements, John <John.Clements@amd.com>; Joshi, Mukul <Mukul.Joshi@amd.com><br> <b>Subject:</b> [PATCH 2/2] drm/amdgpu: Fix RAS page retirement with mode2 reset on Aldebaran</font> <div> </div> </div> <div class="BodyFragment"><font size="2"><span style="font-size:11pt"> <div class="PlainText">During mode2 reset, the GPU is temporarily removed from the<br> mgpu_info list. As a result, page retirement fails because it<br> cannot find the GPU in the GPU list.<br> To fix this, create our own list of GPUs that support MCE notifier<br> based page retirement and use that list to check if the UMC error<br> occurred on a GPU that supports MCE notifier based page retirement.<br> <br> Signed-off-by: Mukul Joshi <mukul.joshi@amd.com><br> ---<br> drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c | 24 ++++++++++++------------<br> 1 file changed, 12 insertions(+), 12 deletions(-)<br> <br> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c<br> index e8875351967e..e8d88c77eb46 100644<br> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c<br> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c<br> @@ -112,7 +112,12 @@ static bool amdgpu_ras_check_bad_page_unlock(struct amdgpu_ras *con,<br> static bool amdgpu_ras_check_bad_page(struct amdgpu_device *adev,<br> uint64_t addr);<br> #ifdef CONFIG_X86_MCE_AMD<br> -static void amdgpu_register_bad_pages_mca_notifier(void);<br> +static void amdgpu_register_bad_pages_mca_notifier(struct amdgpu_device *adev);<br> +struct mce_notifier_adev_list {<br> + struct amdgpu_device *devs[MAX_GPU_INSTANCE];<br> + int num_gpu;<br> +};<br> +static struct mce_notifier_adev_list mce_adev_list;<br> #endif<br> <br> void amdgpu_ras_set_error_query_ready(struct amdgpu_device *adev, bool ready)<br> @@ -2108,7 +2113,7 @@ int amdgpu_ras_recovery_init(struct amdgpu_device *adev)<br> #ifdef CONFIG_X86_MCE_AMD<br> if ((adev->asic_type == CHIP_ALDEBARAN) &&<br> (adev->gmc.xgmi.connected_to_cpu))<br> - amdgpu_register_bad_pages_mca_notifier();<br> + amdgpu_register_bad_pages_mca_notifier(adev);<br> #endif<br> return 0;<br> <br> @@ -2605,24 +2610,18 @@ void amdgpu_release_ras_context(struct amdgpu_device *adev)<br> #ifdef CONFIG_X86_MCE_AMD<br> static struct amdgpu_device *find_adev(uint32_t node_id)<br> {<br> - struct amdgpu_gpu_instance *gpu_instance;<br> int i;<br> struct amdgpu_device *adev = NULL;<br> <br> - mutex_lock(&mgpu_info.mutex);<br> -<br> - for (i = 0; i < mgpu_info.num_gpu; i++) {<br> - gpu_instance = &(mgpu_info.gpu_ins[i]);<br> - adev = gpu_instance->adev;<br> + for (i = 0; i < mce_adev_list.num_gpu; i++) {<br> + adev = mce_adev_list.devs[i];<br> <br> - if (adev->gmc.xgmi.connected_to_cpu &&<br> + if (adev && adev->gmc.xgmi.connected_to_cpu &&<br> adev->gmc.xgmi.physical_node_id == node_id)<br> break;<br> adev = NULL;<br> }<br> <br> - mutex_unlock(&mgpu_info.mutex);<br> -<br> return adev;<br> }<br> <br> @@ -2718,8 +2717,9 @@ static struct notifier_block amdgpu_bad_page_nb = {<br> .priority = MCE_PRIO_UC,<br> };<br> <br> -static void amdgpu_register_bad_pages_mca_notifier(void)<br> +static void amdgpu_register_bad_pages_mca_notifier(struct amdgpu_device *adev)<br> {<br> + mce_adev_list.devs[mce_adev_list.num_gpu++] = adev;<br> /*<br> * Register the x86 notifier only once<br> * with MCE subsystem.<br> -- <br> 2.33.0<br> <br> </div> </span></font></div> </div> </div> </body> </html>