<html><head><meta http-equiv="Content-Type" content="text/html; charset=utf-8"></head><body style="word-wrap: break-word; -webkit-nbsp-mode: space; line-break: after-white-space;" class=""><br class=""><div><br class=""><blockquote type="cite" class=""><div class="">2025年1月4日 01:33,Chen, Xiaogang <<a href="mailto:xiaogang.chen@amd.com" class="">xiaogang.chen@amd.com</a>> 写道:</div><br class="Apple-interchange-newline"><div class="">
<meta http-equiv="Content-Type" content="text/html; charset=utf-8" class="">
  
  <div class=""><p class=""><br class="">
    </p>
    <div class="moz-cite-prefix">On 1/3/2025 1:05 AM, Gerry Liu wrote:<br class="">
    </div>
    <blockquote type="cite" cite="mid:F7876602-343C-44AE-AE5A-A0D69BE8B8A8@linux.alibaba.com" class="">
      
      <br class="">
      <div class=""><br class="">
        <blockquote type="cite" class="">
          <div class="">2025年1月3日 14:19,Chen, Xiaogang <<a href="mailto:xiaogang.chen@amd.com" class="moz-txt-link-freetext" moz-do-not-send="true">xiaogang.chen@amd.com</a>>
            写道:</div>
          <br class="Apple-interchange-newline">
          <div class="">
            <div class=""><p class=""><br class="">
              </p>
              <div class="moz-cite-prefix">On 1/2/2025 11:55 PM, Gerry
                Liu wrote:<br class="">
              </div>
              <blockquote type="cite" cite="mid:DFEBAA6C-D1D8-42BD-8934-58011EBDBFF4@linux.alibaba.com" class=""> <br class="">
                <div class=""><br class="">
                  <blockquote type="cite" class="">
                    <div class="">2025年1月3日 13:44,Chen, Xiaogang <<a href="mailto:xiaogang.chen@amd.com" class="moz-txt-link-freetext" moz-do-not-send="true">xiaogang.chen@amd.com</a>>
                      写道:</div>
                    <br class="Apple-interchange-newline">
                    <div class="">
                      <div class=""><p class=""><br class="">
                        </p>
                        <div class="moz-cite-prefix">On 1/2/2025 8:22
                          PM, Gerry Liu wrote:<br class="">
                        </div>
                        <blockquote type="cite" cite="mid:3CAD4155-244E-44EC-9EC4-D441E17DBEA2@linux.alibaba.com" class=""> <br class="">
                          <div class=""><br class="">
                            <blockquote type="cite" class="">
                              <div class="">2025年1月3日 07:08,Chen,
                                Xiaogang <<a href="mailto:xiaogang.chen@amd.com" class="moz-txt-link-freetext" moz-do-not-send="true">xiaogang.chen@amd.com</a>>
                                写道:</div>
                              <br class="Apple-interchange-newline">
                              <div class="">
                                <div class=""><p class=""><br class="">
                                  </p>
                                  <div class="moz-cite-prefix">On
                                    1/1/2025 11:36 PM, Jiang Liu wrote:<br class="">
                                  </div>
                                  <blockquote type="cite" cite="mid:7aace7d239b729340e311ad6e08a14f60b87a361.1735795671.git.gerry@linux.alibaba.com" class="">
                                    <pre wrap="" class="moz-quote-pre">On error recover path during device probe, it may trigger invalid
memory access as below:
024-12-25 12:00:53 [ 2703.773040] general protection fault, probably for non-canonical address 0x52445f4749464e4f: 0000 [#1] SMP NOPTI
2024-12-25 12:00:53 [ 2703.785199] CPU: 157 PID: 151951 Comm: rmmod Kdump: loaded Tainted: G        W  OE     5.10.134-17.2.al8.x86_64 #1
2024-12-25 12:00:53 [ 2703.797515] Hardware name: Alibaba Alibaba Cloud ECS/Alibaba Cloud ECS, BIOS 3.0.ES.AL.P.087.05 04/07/2024
2024-12-25 12:00:53 [ 2703.809188] RIP: 0010:kgd2kfd_device_exit+0x6/0x60 [amdgpu]
2024-12-25 12:00:53 [ 2703.816136] Code: ff 48 c7 83 38 01 00 00 80 45 e4 c0 c7 83 40 01 00 00 08 0f 00 00 e9 cd fa ff ff 66 0f 1f 84 00 00 00 00 00 0f
1f 44 00 00 55 <80> bf 28 01 00 00 00 48 89 fd 75 09 48 89 ef 5d e9 65 df 9d f4 8b
2024-12-25 12:00:54 [ 2703.838622] RSP: 0018:ffffb5313df07e10 EFLAGS: 00010202
2024-12-25 12:00:54 [ 2703.845216] RAX: 0000000000000000 RBX: ffff97ad689a3ff0 RCX: 0000000080400014
2024-12-25 12:00:54 [ 2703.853935] RDX: 0000000080400015 RSI: ffff97ad627e93d8 RDI: 52445f4749464e4f
2024-12-25 12:00:54 [ 2703.862652] RBP: ffff97ad689a3ff0 R08: 0000000000000000 R09: ffffffffb5814c00
2024-12-25 12:00:54 [ 2703.871368] R10: ffff97ad627e9280 R11: 0000000000000001 R12: ffffb5313df07e98
2024-12-25 12:00:54 [ 2703.880068] R13: ffff97ad689a1810 R14: 0000000000000001 R15: 0000000000000000
2024-12-25 12:00:54 [ 2703.888768] FS:  00007fa4db81e740(0000) GS:ffff98a93ec80000(0000) knlGS:0000000000000000
2024-12-25 12:00:54 [ 2703.898547] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
2024-12-25 12:00:54 [ 2703.905684] CR2: 00007f4502deca00 CR3: 000001008fc50001 CR4: 0000000002770ee0
2024-12-25 12:00:54 [ 2703.914382] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
2024-12-25 12:00:54 [ 2703.923066] DR3: 0000000000000000 DR6: 00000000fffe07f0 DR7: 0000000000000400
2024-12-25 12:00:54 [ 2703.931746] PKRU: 55555554
2024-12-25 12:00:54 [ 2703.935444] Call Trace:
2024-12-25 12:00:54 [ 2703.938962]  amdgpu_amdkfd_device_fini_sw+0x1a/0x40 [amdgpu]
2024-12-25 12:00:54 [ 2703.946080]  amdgpu_device_ip_fini.isra.0+0x3d/0x1b0 [amdgpu]
2024-12-25 12:00:54 [ 2703.953278]  amdgpu_device_fini_sw+0x2a/0x240 [amdgpu]
2024-12-25 12:00:54 [ 2703.959789]  amdgpu_driver_release_kms+0x12/0x30 [amdgpu]
2024-12-25 12:00:54 [ 2703.966501]  devm_drm_dev_init_release+0x42/0x70 [drm]
2024-12-25 12:00:54 [ 2703.972891]  release_nodes+0x6e/0xb0
2024-12-25 12:00:54 [ 2703.977522]  amdgpu_xcp_drv_release+0x38/0x80 [amdxcp]
2024-12-25 12:00:54 [ 2703.983906]  __do_sys_delete_module.constprop.0+0x138/0x2a0
2024-12-25 12:00:54 [ 2703.990775]  ? exit_to_user_mode_loop+0xd6/0x120
2024-12-25 12:00:54 [ 2703.996563]  do_syscall_64+0x2e/0x50
2024-12-25 12:00:54 [ 2704.001166]  entry_SYSCALL_64_after_hwframe+0x62/0xc7
2024-12-25 12:00:54 [ 2704.007432] RIP: 0033:0x7fa4db2620cb
2024-12-25 12:00:54 [ 2704.012025] Code: 73 01 c3 48 8b 0d a5 6d 19 00 f7 d8 64 89 01 48 83 c8 ff c3 66 2e 0f 1f 84 00 00 00 00 00 90 f3 0f 1e fa b8 b0
00 00 00 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d 75 6d 19 00 f7 d8 64 89 01 48

Signed-off-by: Jiang Liu <a class="moz-txt-link-rfc2396E" href="mailto:gerry@linux.alibaba.com" moz-do-not-send="true"><gerry@linux.alibaba.com></a>
---
 drivers/gpu/drm/amd/amdkfd/kfd_device.c | 7 ++++++-
 1 file changed, 6 insertions(+), 1 deletion(-)

diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_device.c b/drivers/gpu/drm/amd/amdkfd/kfd_device.c
index b6c5ffd4630b..58c1b5fcf785 100644
--- a/drivers/gpu/drm/amd/amdkfd/kfd_device.c
+++ b/drivers/gpu/drm/amd/amdkfd/kfd_device.c
@@ -663,6 +663,8 @@ static void kfd_cleanup_nodes(struct kfd_dev *kfd, unsigned int num_nodes)
 
        for (i = 0; i < num_nodes; i++) {
                knode = kfd->nodes[i];
+               if (!knode)
+                       continue;</pre>
                                  </blockquote><p class="">I wonder how knode can be
                                    null here? <span style="white-space: pre-wrap" class="">kfd_cleanup_nodes</span> is
                                    called by kgd2kfd_device_exit under
                                    condition if
                                    (kfd->init_complete).
                                    kfd->init_complete is true only
                                    after all kfd node got initialized
                                    at kgd2kfd_device_init. If kfd
                                    driver init fail 
                                    kfd->init_complete would be
                                    false, then <span style="white-space: pre-wrap" class="">kfd_cleanup_node would not get called.</span></p>
                                </div>
                              </div>
                            </blockquote>
                            <div class="">Hi Xiaogang,</div>
                            <div class=""><span class="Apple-tab-span" style="white-space:pre">       </span>It may get
                              triggered on error handling path of
                              ‘kid2kfd_device_init()`, when it jump to
                              label `node_alloc_error` and </div>
                            <div class="">then call
                              `kfd_cleanup_nodes()`.</div>
                            <div class=""><br class="">
                            </div>
                          </div>
                        </blockquote><p class="">If it was the case kzalloc failed.
                          That means there is no memory available even
                          to allocate struct kfd_node. From the
                          backtrace the <span style="white-space: pre-wrap" class="">general protection fault happened at </span></p>
                        <pre wrap="" class="moz-quote-pre">RIP: 0010:kgd2kfd_device_exit+0x6/0x60 [amdgpu]

It happened during driver module got released, not init time. I do not see how the patch is related to the issue you talked.
</pre>
                      </div>
                    </div>
                  </blockquote>
                  <div class="">Aha, it’s caused by another bug which
                    got fixed by "<font face="PingFang SC" class=""><span style="" class="">[PATCH 4/6] amdgpu: fix use
                        after free bug related to
                        amdgpu_driver_release_kms()</span><span style="caret-color: rgba(0, 0, 0, 0.85);" class="">”</span><span style="" class="">.</span></font></div>
                  <div class=""><span style="font-family: "PingFang SC";" class="">Without the fix in "[PATCH 4/6] amdgpu:
                      fix use after free bug related to
                      amdgpu_driver_release_kms()</span><span style="font-family: "PingFang SC"; caret-color: rgba(0, 0, 0, 0.85);" class="">”</span><span style="" class=""><font face="PingFang SC" class="">,
                        kgd2kfd_device_exit() will got called</font></span></div>
                  <div class=""><font face="PingFang SC" class=""><span style="caret-color: rgba(0, 0, 0, 0.85);" class="">twice</span><span style="" class=""> in
                        case of device probe failure. I caused me some
                        time to figure out the double free issue.</span></font></div>
                  <div class=""><span style="" class=""><font face="PingFang SC" class="">Actually we should
                        reset kfd->init_completed to false in
                        function kgd2kfd_device_exit().</font></span></div>
                </div>
              </blockquote><p class=""><font face="PingFang SC" class="">We can set </font><span style="" class=""><font face="PingFang SC" class="">
                    kfd->init_completed = false at end of </font></span><span style="" class=""><font face="PingFang SC" class="">kgd2kfd_device_exit,
                    but how </font></span><span style="" class=""><font face="PingFang SC" class="">kgd2kfd_device_exit was
                    called two times? is there another bug caused that?</font></span></p>
            </div>
          </div>
        </blockquote>
        <div class="">I guess it caused by another bug related to the way amdgpu
          cooperates with the amdgpu_xcp driver. It would be better to
          enhance amdgpu_xcp driver either.</div>
      </div>
    </blockquote><p class="">kfd driver has considered which kfd nodes got initialized and
      release them accordingly. From what saw here seems you may mix
      different issues or not target the real issue. Let's have
      backtrace match the changes.</p></div></div></blockquote>Sure, I will rework this patch with log message to address the possible resource leakage.<br class=""><blockquote type="cite" class=""><div class=""><div class=""><p class="">Regards</p><p class="">Xiaogang<br class="">
    </p>
    <blockquote type="cite" cite="mid:F7876602-343C-44AE-AE5A-A0D69BE8B8A8@linux.alibaba.com" class="">
      <div class=""><br class="">
        <blockquote type="cite" class="">
          <div class="">
            <div class=""><p class=""><span style="" class=""><font face="PingFang SC" class="">Regards</font></span></p><p class=""><span style="" class=""><font face="PingFang SC" class="">Xiaogang<br class="">
                  </font></span></p>
              <div class=""><span style="" class=""></span><br class="webkit-block-placeholder">
              </div>
              <blockquote type="cite" cite="mid:DFEBAA6C-D1D8-42BD-8934-58011EBDBFF4@linux.alibaba.com" class="">
                <div class="">
                  <div class=""><br class="">
                  </div>
                  <blockquote type="cite" class="">
                    <div class="">
                      <div class="">
                        <pre wrap="" class="moz-quote-pre">Regards
Xiaogang


</pre>
                        <div class=""><br class="webkit-block-placeholder">
                        </div>
                        <blockquote type="cite" cite="mid:3CAD4155-244E-44EC-9EC4-D441E17DBEA2@linux.alibaba.com" class="">
                          <div class="">
                            <div class="">Thanks,</div>
                            <div class="">Gerry</div>
                            <br class="">
                            <blockquote type="cite" class="">
                              <div class="">
                                <div class="">
                                  <div class=""><br class="webkit-block-placeholder">
                                  </div><p class=""><span style="white-space: pre-wrap" class="">Regards</span></p><p class=""><span style="white-space: pre-wrap" class="">Xiaogang
</span></p>
                                  <blockquote type="cite" cite="mid:7aace7d239b729340e311ad6e08a14f60b87a361.1735795671.git.gerry@linux.alibaba.com" class="">
                                    <pre wrap="" class="moz-quote-pre">               device_queue_manager_uninit(knode->dqm);
                kfd_interrupt_exit(knode);
                kfd_topology_remove_device(knode);
@@ -954,7 +956,10 @@ void kgd2kfd_device_exit(struct kfd_dev *kfd)
                kfd_doorbell_fini(kfd);
                ida_destroy(&kfd->doorbell_ida);
                kfd_gtt_sa_fini(kfd);
-               amdgpu_amdkfd_free_gtt_mem(kfd->adev, &kfd->gtt_mem);
+               if (kfd->gtt_mem) {
+                       amdgpu_amdkfd_free_gtt_mem(kfd->adev, &kfd->gtt_mem);
+                       kfd->gtt_mem = NULL;
+               }
        }
 
        kfree(kfd);
</pre>
                                  </blockquote>
                                </div>
                              </div>
                            </blockquote>
                          </div>
                          <br class="">
                        </blockquote>
                      </div>
                    </div>
                  </blockquote>
                </div>
                <br class="">
              </blockquote>
            </div>
          </div>
        </blockquote>
      </div>
      <br class="">
    </blockquote>
  </div>

</div></blockquote></div><br class=""></body></html>