<html xmlns:o="urn:schemas-microsoft-com:office:office" xmlns:w="urn:schemas-microsoft-com:office:word" xmlns:m="http://schemas.microsoft.com/office/2004/12/omml" xmlns="http://www.w3.org/TR/REC-html40">
<head>
<meta http-equiv="Content-Type" content="text/html; charset=Windows-1252">
<meta name="Generator" content="Microsoft Word 15 (filtered medium)">
<style><!--
/* Font Definitions */
@font-face
        {font-family:SimSun;
        panose-1:2 1 6 0 3 1 1 1 1 1;}
@font-face
        {font-family:"Cambria Math";
        panose-1:2 4 5 3 5 4 6 3 2 4;}
@font-face
        {font-family:DengXian;
        panose-1:2 1 6 0 3 1 1 1 1 1;}
@font-face
        {font-family:Calibri;
        panose-1:2 15 5 2 2 2 4 3 2 4;}
@font-face
        {font-family:"\@DengXian";
        panose-1:2 1 6 0 3 1 1 1 1 1;}
@font-face
        {font-family:"\@SimSun";
        panose-1:2 1 6 0 3 1 1 1 1 1;}
/* Style Definitions */
p.MsoNormal, li.MsoNormal, div.MsoNormal
        {margin:0in;
        font-size:10.0pt;
        font-family:"Calibri",sans-serif;}
span.EmailStyle19
        {mso-style-type:personal-reply;
        font-family:"Calibri",sans-serif;
        color:windowtext;}
.MsoChpDefault
        {mso-style-type:export-only;
        font-size:10.0pt;}
@page WordSection1
        {size:8.5in 11.0in;
        margin:1.0in 1.0in 1.0in 1.0in;}
div.WordSection1
        {page:WordSection1;}
--></style>
</head>
<body lang="EN-US" link="#0563C1" vlink="#954F72" style="word-wrap:break-word">
<p style="font-family:Arial;font-size:10pt;color:#0000FF;margin:5pt;" align="Left">
[AMD Official Use Only - General]<br>
</p>
<br>
<div>
<div class="WordSection1">
<p class="MsoNormal"><span style="font-size:11.0pt">+static void amdgpu_device_gpu_reset(struct amdgpu_device *adev)<br>
+{<br>
+       struct amdgpu_reset_context reset_context;<br>
+<br>
+       memset(&reset_context, 0, sizeof(reset_context));<br>
+       reset_context.method = AMD_RESET_METHOD_NONE;<br>
+       reset_context.reset_req_dev = adev;<br>
+       set_bit(AMDGPU_NEED_FULL_RESET, &reset_context.flags);<br>
+       set_bit(AMDGPU_RESET_FOR_DEVICE_REMOVE, &reset_context.flags);<br>
+<br>
+       amdgpu_device_gpu_recover(adev, NULL, &reset_context);<br>
+}<o:p></o:p></span></p>
<p class="MsoNormal"><span style="font-size:11.0pt"><o:p> </o:p></span></p>
<p class="MsoNormal"><span style="font-size:11.0pt">This wrapper is kind of confusing. Let’s keep amdgpu_device_gpu_recover as the only entry point for recovery handling. If possible, please drop this wrapper,  initialize reset_context and call amdgpu_device_gpu_recover
 directly<o:p></o:p></span></p>
<p class="MsoNormal"><span style="font-size:11.0pt"><o:p> </o:p></span></p>
<p class="MsoNormal"><span style="font-size:11.0pt"><o:p> </o:p></span></p>
<p class="MsoNormal"><span style="font-size:11.0pt">+               /* If in_remove is true, psp_hw_fini should be executed after<br>
+                *  psp_suspend to free psp shared buffers.<br>
+                */<br>
+               if (adev->in_remove && (adev->ip_blocks[i].version->type == AMD_IP_BLOCK_TYPE_PSP))<br>
+                       continue;<br>
<br>
</span><span style="font-size:11.0pt;font-family:SimSun"><o:p></o:p></span></p>
<p class="MsoNormal"><span style="font-size:11.0pt">Can you please share more details to help me understand the sequence adjustment here?
<o:p></o:p></span></p>
<p class="MsoNormal"><span style="font-size:11.0pt"><o:p> </o:p></span></p>
<p class="MsoNormal"><span style="font-size:11.0pt">Regards,<br>
Hawking</span><span style="font-size:11.0pt"><o:p></o:p></span></p>
<p class="MsoNormal"><span style="font-size:11.0pt"><o:p> </o:p></span></p>
<div style="border:none;border-top:solid #B5C4DF 1.0pt;padding:3.0pt 0in 0in 0in">
<p class="MsoNormal" style="margin-bottom:12.0pt"><b><span style="font-size:12.0pt;color:black">From:
</span></b><span style="font-size:12.0pt;color:black">Chai, Thomas <YiPeng.Chai@amd.com><br>
<b>Date: </b>Tuesday, September 6, 2022 at 15:48<br>
<b>To: </b>amd-gfx@lists.freedesktop.org <amd-gfx@lists.freedesktop.org><br>
<b>Cc: </b>Chai, Thomas <YiPeng.Chai@amd.com>, Zhang, Hawking <Hawking.Zhang@amd.com>, Zhou1, Tao <Tao.Zhou1@amd.com>, Wang, Yang(Kevin) <KevinYang.Wang@amd.com>, Chai, Thomas <YiPeng.Chai@amd.com><br>
<b>Subject: </b>[PATCH V2] drm/amdgpu: Adjust removal control flow for smu v13_0_2<o:p></o:p></span></p>
</div>
<div>
<p class="MsoNormal" style="margin-bottom:12.0pt"><span style="font-size:11.0pt">Adjust removal control flow for smu v13_0_2:<br>
   During amdgpu uninstallation, when removing the first<br>
device, the kernel needs to first send a mode1reset message<br>
to all gpu devices. Otherwise, smu initialization will fail<br>
the next time amdgpu is installed.<br>
<br>
V2:<br>
1. Update commit comments.<br>
2. Remove the global variable amdgpu_device_remove_cnt<br>
   and add a variable to the structure amdgpu_hive_info.<br>
3. Use hive to detect the first removed device instead of<br>
   a global variable.<br>
<br>
Signed-off-by: YiPeng Chai <YiPeng.Chai@amd.com><br>
---<br>
 drivers/gpu/drm/amd/amdgpu/amdgpu.h        |  3 ++<br>
 drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 40 +++++++++++++++++++++-<br>
 drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c    | 35 +++++++++++++++++++<br>
 drivers/gpu/drm/amd/amdgpu/amdgpu_psp.c    | 16 ++++++++-<br>
 drivers/gpu/drm/amd/amdgpu/amdgpu_reset.h  |  1 +<br>
 drivers/gpu/drm/amd/amdgpu/amdgpu_xgmi.h   |  1 +<br>
 drivers/gpu/drm/amd/pm/amdgpu_pm.c         |  6 +++-<br>
 7 files changed, 99 insertions(+), 3 deletions(-)<br>
<br>
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu.h b/drivers/gpu/drm/amd/amdgpu/amdgpu.h<br>
index 79bb6fd83094..465295318830 100644<br>
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu.h<br>
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu.h<br>
@@ -997,6 +997,9 @@ struct amdgpu_device {<br>
         bool                            in_s4;<br>
         bool                            in_s0ix;<br>
 <br>
+       /* uninstall */<br>
+       bool                            in_remove;<br>
+<br>
         enum pp_mp1_state               mp1_state;<br>
         struct amdgpu_doorbell_index doorbell_index;<br>
 <br>
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c<br>
index 62b26f0e37b0..1402717673f7 100644<br>
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c<br>
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c<br>
@@ -2999,6 +2999,13 @@ static int amdgpu_device_ip_suspend_phase2(struct amdgpu_device *adev)<br>
                         DRM_ERROR("suspend of IP block <%s> failed %d\n",<br>
                                   adev->ip_blocks[i].version->funcs->name, r);<br>
                 }<br>
+<br>
+               /* If in_remove is true, psp_hw_fini should be executed after<br>
+                *  psp_suspend to free psp shared buffers.<br>
+                */<br>
+               if (adev->in_remove && (adev->ip_blocks[i].version->type == AMD_IP_BLOCK_TYPE_PSP))<br>
+                       continue;<br>
+<br>
                 adev->ip_blocks[i].status.hw = false;<br>
                 /* handle putting the SMC in the appropriate state */<br>
                 if(!amdgpu_sriov_vf(adev)){<br>
@@ -4739,6 +4746,7 @@ int amdgpu_do_asic_reset(struct list_head *device_list_handle,<br>
         struct amdgpu_device *tmp_adev = NULL;<br>
         bool need_full_reset, skip_hw_reset, vram_lost = false;<br>
         int r = 0;<br>
+       bool gpu_reset_for_dev_remove = 0;<br>
 <br>
         /* Try reset handler method first */<br>
         tmp_adev = list_first_entry(device_list_handle, struct amdgpu_device,<br>
@@ -4758,6 +4766,10 @@ int amdgpu_do_asic_reset(struct list_head *device_list_handle,<br>
                 test_bit(AMDGPU_NEED_FULL_RESET, &reset_context->flags);<br>
         skip_hw_reset = test_bit(AMDGPU_SKIP_HW_RESET, &reset_context->flags);<br>
 <br>
+       gpu_reset_for_dev_remove =<br>
+               test_bit(AMDGPU_RESET_FOR_DEVICE_REMOVE, &reset_context->flags) &&<br>
+                       test_bit(AMDGPU_NEED_FULL_RESET, &reset_context->flags);<br>
+<br>
         /*<br>
          * ASIC reset has to be done on all XGMI hive nodes ASAP<br>
          * to allow proper links negotiation in FW (within 1 sec)<br>
@@ -4802,6 +4814,16 @@ int amdgpu_do_asic_reset(struct list_head *device_list_handle,<br>
                 amdgpu_ras_intr_cleared();<br>
         }<br>
 <br>
+       /* Fixed the problem that BIOS signature errors and psp bootloader<br>
+        * failure to load kdb on next amdgpu install.<br>
+        */<br>
+       if (gpu_reset_for_dev_remove) {<br>
+               list_for_each_entry(tmp_adev, device_list_handle, reset_list)<br>
+                       amdgpu_device_ip_resume_phase1(tmp_adev);<br>
+<br>
+               goto end;<br>
+       }<br>
+<br>
         list_for_each_entry(tmp_adev, device_list_handle, reset_list) {<br>
                 if (need_full_reset) {<br>
                         /* post card */<br>
@@ -5124,6 +5146,11 @@ int amdgpu_device_gpu_recover(struct amdgpu_device *adev,<br>
         bool need_emergency_restart = false;<br>
         bool audio_suspended = false;<br>
         int tmp_vram_lost_counter;<br>
+       bool gpu_reset_for_dev_remove = false;<br>
+<br>
+       gpu_reset_for_dev_remove =<br>
+                       test_bit(AMDGPU_RESET_FOR_DEVICE_REMOVE, &reset_context->flags) &&<br>
+                               test_bit(AMDGPU_NEED_FULL_RESET, &reset_context->flags);<br>
 <br>
         /*<br>
          * Special case: RAS triggered and full reset isn't supported<br>
@@ -5159,8 +5186,11 @@ int amdgpu_device_gpu_recover(struct amdgpu_device *adev,<br>
          */<br>
         INIT_LIST_HEAD(&device_list);<br>
         if (!amdgpu_sriov_vf(adev) && (adev->gmc.xgmi.num_physical_nodes > 1)) {<br>
-               list_for_each_entry(tmp_adev, &hive->device_list, gmc.xgmi.head)<br>
+               list_for_each_entry(tmp_adev, &hive->device_list, gmc.xgmi.head) {<br>
                         list_add_tail(&tmp_adev->reset_list, &device_list);<br>
+                       if (adev->in_remove)<br>
+                               tmp_adev->in_remove = true;<br>
+               }<br>
                 if (!list_is_first(&adev->reset_list, &device_list))<br>
                         list_rotate_to_front(&adev->reset_list, &device_list);<br>
                 device_list_handle = &device_list;<br>
@@ -5243,6 +5273,10 @@ int amdgpu_device_gpu_recover(struct amdgpu_device *adev,<br>
 <br>
 retry:  /* Rest of adevs pre asic reset from XGMI hive. */<br>
         list_for_each_entry(tmp_adev, device_list_handle, reset_list) {<br>
+               if (gpu_reset_for_dev_remove) {<br>
+                       /* Workaroud for ASICs need to disable SMC first */<br>
+                       amdgpu_device_smu_fini_early(tmp_adev);<br>
+               }<br>
                 r = amdgpu_device_pre_asic_reset(tmp_adev, reset_context);<br>
                 /*TODO Should we stop ?*/<br>
                 if (r) {<br>
@@ -5276,6 +5310,9 @@ int amdgpu_device_gpu_recover(struct amdgpu_device *adev,<br>
                         adev->asic_reset_res = 0;<br>
                         goto retry;<br>
                 }<br>
+<br>
+               if (!r && gpu_reset_for_dev_remove)<br>
+                       goto recover_end;<br>
         }<br>
 <br>
 skip_hw_reset:<br>
@@ -5349,6 +5386,7 @@ int amdgpu_device_gpu_recover(struct amdgpu_device *adev,<br>
                 amdgpu_device_unset_mp1_state(tmp_adev);<br>
         }<br>
 <br>
+recover_end:<br>
         tmp_adev = list_first_entry(device_list_handle, struct amdgpu_device,<br>
                                             reset_list);<br>
         amdgpu_device_unlock_reset_domain(tmp_adev->reset_domain);<br>
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c<br>
index 728a0933ea6f..9271f219d8fa 100644<br>
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c<br>
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c<br>
@@ -2175,6 +2175,19 @@ static int amdgpu_pci_probe(struct pci_dev *pdev,<br>
         return ret;<br>
 }<br>
 <br>
+static void amdgpu_device_gpu_reset(struct amdgpu_device *adev)<br>
+{<br>
+       struct amdgpu_reset_context reset_context;<br>
+<br>
+       memset(&reset_context, 0, sizeof(reset_context));<br>
+       reset_context.method = AMD_RESET_METHOD_NONE;<br>
+       reset_context.reset_req_dev = adev;<br>
+       set_bit(AMDGPU_NEED_FULL_RESET, &reset_context.flags);<br>
+       set_bit(AMDGPU_RESET_FOR_DEVICE_REMOVE, &reset_context.flags);<br>
+<br>
+       amdgpu_device_gpu_recover(adev, NULL, &reset_context);<br>
+}<br>
+<br>
 static void<br>
 amdgpu_pci_remove(struct pci_dev *pdev)<br>
 {<br>
@@ -2186,6 +2199,28 @@ amdgpu_pci_remove(struct pci_dev *pdev)<br>
                 pm_runtime_forbid(dev->dev);<br>
         }<br>
 <br>
+       if (adev->asic_type == CHIP_ALDEBARAN) {<br>
+               bool need_to_reset_gpu = false;<br>
+<br>
+               adev->in_remove = true;<br>
+               if (adev->gmc.xgmi.num_physical_nodes > 1) {<br>
+                       struct amdgpu_hive_info *hive;<br>
+<br>
+                       hive = amdgpu_get_xgmi_hive(adev);<br>
+                       if (hive->device_remove_count == 0)<br>
+                               need_to_reset_gpu = true;<br>
+                       hive->device_remove_count++;<br>
+                       amdgpu_put_xgmi_hive(hive);<br>
+               } else<br>
+                       need_to_reset_gpu = true;<br>
+<br>
+               /* Workaround for ASICs need to reset SMU.<br>
+                * Called only when the first device is removed.<br>
+                */<br>
+               if (need_to_reset_gpu)<br>
+                       amdgpu_device_gpu_reset(adev);<br>
+       }<br>
+<br>
         amdgpu_driver_unload_kms(dev);<br>
 <br>
         drm_dev_unplug(dev);<br>
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_psp.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_psp.c<br>
index 28ca0a94b8a5..1f19f9fa4396 100644<br>
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_psp.c<br>
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_psp.c<br>
@@ -2647,7 +2647,15 @@ static int psp_hw_fini(void *handle)<br>
         psp_asd_terminate(psp);<br>
         psp_tmr_terminate(psp);<br>
 <br>
-       psp_ring_destroy(psp, PSP_RING_TYPE__KM);<br>
+       /* If in_remove is true, psp_suspend is called before<br>
+        *  psp_hw_fini. psp ring has been stopped in psp_suspend.<br>
+        */<br>
+       if (adev->in_remove && psp->km_ring.ring_mem)<br>
+               amdgpu_bo_free_kernel(&adev->firmware.rbuf,<br>
+                               &psp->km_ring.ring_mem_mc_addr,<br>
+                               (void **)&psp->km_ring.ring_mem);<br>
+       else<br>
+               psp_ring_destroy(psp, PSP_RING_TYPE__KM);<br>
 <br>
         psp_free_shared_bufs(psp);<br>
 <br>
@@ -2715,6 +2723,12 @@ static int psp_suspend(void *handle)<br>
         }<br>
 <br>
 out:<br>
+       /* If in_remove is true, psp_hw_fini will be called after<br>
+        * psp_suspend. Psp shared buffer will be freed in psp_hw_fini.<br>
+        */<br>
+       if (adev->in_remove)<br>
+               return ret;<br>
+<br>
         psp_free_shared_bufs(psp);<br>
 <br>
         return ret;<br>
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_reset.h b/drivers/gpu/drm/amd/amdgpu/amdgpu_reset.h<br>
index f71b83c42590..dc43fcb93eac 100644<br>
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_reset.h<br>
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_reset.h<br>
@@ -31,6 +31,7 @@ enum AMDGPU_RESET_FLAGS {<br>
         AMDGPU_NEED_FULL_RESET = 0,<br>
         AMDGPU_SKIP_HW_RESET = 1,<br>
         AMDGPU_SKIP_MODE2_RESET = 2,<br>
+       AMDGPU_RESET_FOR_DEVICE_REMOVE = 3,<br>
 };<br>
 <br>
 struct amdgpu_reset_context {<br>
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_xgmi.h b/drivers/gpu/drm/amd/amdgpu/amdgpu_xgmi.h<br>
index 552e6fb55aa8..30dcc1681b4e 100644<br>
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_xgmi.h<br>
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_xgmi.h<br>
@@ -43,6 +43,7 @@ struct amdgpu_hive_info {<br>
         } pstate;<br>
 <br>
         struct amdgpu_reset_domain *reset_domain;<br>
+       uint32_t device_remove_count;<br>
 };<br>
 <br>
 struct amdgpu_pcs_ras_field {<br>
diff --git a/drivers/gpu/drm/amd/pm/amdgpu_pm.c b/drivers/gpu/drm/amd/pm/amdgpu_pm.c<br>
index 5e318b3f6c0f..6be90076c9f3 100644<br>
--- a/drivers/gpu/drm/amd/pm/amdgpu_pm.c<br>
+++ b/drivers/gpu/drm/amd/pm/amdgpu_pm.c<br>
@@ -3405,7 +3405,11 @@ int amdgpu_pm_sysfs_init(struct amdgpu_device *adev)<br>
 <br>
 void amdgpu_pm_sysfs_fini(struct amdgpu_device *adev)<br>
 {<br>
-       if (adev->pm.dpm_enabled == 0)<br>
+       /* If in_remove is true, the check for pm.dpm_enabled<br>
+        * needs to be skipped, since smu_suspend is called before<br>
+        * amdgpu_pm_sysfs_fini in the device removal path.<br>
+        */<br>
+       if ((adev->pm.dpm_enabled == 0) && !adev->in_remove)<br>
                 return;<br>
 <br>
         if (adev->pm.int_hwmon_dev)<br>
-- <br>
2.25.1<o:p></o:p></span></p>
</div>
</div>
</div>
</body>
</html>