<html><head><meta http-equiv="Content-Type" content="text/html; charset=utf-8"></head><body style="word-wrap: break-word; -webkit-nbsp-mode: space; line-break: after-white-space;" class=""><br class=""><div><br class=""><blockquote type="cite" class=""><div class="">2025年1月3日 14:19,Chen, Xiaogang <<a href="mailto:xiaogang.chen@amd.com" class="">xiaogang.chen@amd.com</a>> 写道:</div><br class="Apple-interchange-newline"><div class="">
<meta http-equiv="Content-Type" content="text/html; charset=utf-8" class="">
  
  <div class=""><p class=""><br class="">
    </p>
    <div class="moz-cite-prefix">On 1/2/2025 11:55 PM, Gerry Liu wrote:<br class="">
    </div>
    <blockquote type="cite" cite="mid:DFEBAA6C-D1D8-42BD-8934-58011EBDBFF4@linux.alibaba.com" class="">
      
      <br class="">
      <div class=""><br class="">
        <blockquote type="cite" class="">
          <div class="">2025年1月3日 13:44,Chen, Xiaogang <<a href="mailto:xiaogang.chen@amd.com" class="moz-txt-link-freetext" moz-do-not-send="true">xiaogang.chen@amd.com</a>>
            写道:</div>
          <br class="Apple-interchange-newline">
          <div class="">
            <div class=""><p class=""><br class="">
              </p>
              <div class="moz-cite-prefix">On 1/2/2025 8:22 PM, Gerry
                Liu wrote:<br class="">
              </div>
              <blockquote type="cite" cite="mid:3CAD4155-244E-44EC-9EC4-D441E17DBEA2@linux.alibaba.com" class=""> <br class="">
                <div class=""><br class="">
                  <blockquote type="cite" class="">
                    <div class="">2025年1月3日 07:08,Chen, Xiaogang <<a href="mailto:xiaogang.chen@amd.com" class="moz-txt-link-freetext" moz-do-not-send="true">xiaogang.chen@amd.com</a>>
                      写道:</div>
                    <br class="Apple-interchange-newline">
                    <div class="">
                      <div class=""><p class=""><br class="">
                        </p>
                        <div class="moz-cite-prefix">On 1/1/2025 11:36
                          PM, Jiang Liu wrote:<br class="">
                        </div>
                        <blockquote type="cite" cite="mid:7aace7d239b729340e311ad6e08a14f60b87a361.1735795671.git.gerry@linux.alibaba.com" class="">
                          <pre wrap="" class="moz-quote-pre">On error recover path during device probe, it may trigger invalid
memory access as below:
024-12-25 12:00:53 [ 2703.773040] general protection fault, probably for non-canonical address 0x52445f4749464e4f: 0000 [#1] SMP NOPTI
2024-12-25 12:00:53 [ 2703.785199] CPU: 157 PID: 151951 Comm: rmmod Kdump: loaded Tainted: G        W  OE     5.10.134-17.2.al8.x86_64 #1
2024-12-25 12:00:53 [ 2703.797515] Hardware name: Alibaba Alibaba Cloud ECS/Alibaba Cloud ECS, BIOS 3.0.ES.AL.P.087.05 04/07/2024
2024-12-25 12:00:53 [ 2703.809188] RIP: 0010:kgd2kfd_device_exit+0x6/0x60 [amdgpu]
2024-12-25 12:00:53 [ 2703.816136] Code: ff 48 c7 83 38 01 00 00 80 45 e4 c0 c7 83 40 01 00 00 08 0f 00 00 e9 cd fa ff ff 66 0f 1f 84 00 00 00 00 00 0f
1f 44 00 00 55 <80> bf 28 01 00 00 00 48 89 fd 75 09 48 89 ef 5d e9 65 df 9d f4 8b
2024-12-25 12:00:54 [ 2703.838622] RSP: 0018:ffffb5313df07e10 EFLAGS: 00010202
2024-12-25 12:00:54 [ 2703.845216] RAX: 0000000000000000 RBX: ffff97ad689a3ff0 RCX: 0000000080400014
2024-12-25 12:00:54 [ 2703.853935] RDX: 0000000080400015 RSI: ffff97ad627e93d8 RDI: 52445f4749464e4f
2024-12-25 12:00:54 [ 2703.862652] RBP: ffff97ad689a3ff0 R08: 0000000000000000 R09: ffffffffb5814c00
2024-12-25 12:00:54 [ 2703.871368] R10: ffff97ad627e9280 R11: 0000000000000001 R12: ffffb5313df07e98
2024-12-25 12:00:54 [ 2703.880068] R13: ffff97ad689a1810 R14: 0000000000000001 R15: 0000000000000000
2024-12-25 12:00:54 [ 2703.888768] FS:  00007fa4db81e740(0000) GS:ffff98a93ec80000(0000) knlGS:0000000000000000
2024-12-25 12:00:54 [ 2703.898547] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
2024-12-25 12:00:54 [ 2703.905684] CR2: 00007f4502deca00 CR3: 000001008fc50001 CR4: 0000000002770ee0
2024-12-25 12:00:54 [ 2703.914382] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
2024-12-25 12:00:54 [ 2703.923066] DR3: 0000000000000000 DR6: 00000000fffe07f0 DR7: 0000000000000400
2024-12-25 12:00:54 [ 2703.931746] PKRU: 55555554
2024-12-25 12:00:54 [ 2703.935444] Call Trace:
2024-12-25 12:00:54 [ 2703.938962]  amdgpu_amdkfd_device_fini_sw+0x1a/0x40 [amdgpu]
2024-12-25 12:00:54 [ 2703.946080]  amdgpu_device_ip_fini.isra.0+0x3d/0x1b0 [amdgpu]
2024-12-25 12:00:54 [ 2703.953278]  amdgpu_device_fini_sw+0x2a/0x240 [amdgpu]
2024-12-25 12:00:54 [ 2703.959789]  amdgpu_driver_release_kms+0x12/0x30 [amdgpu]
2024-12-25 12:00:54 [ 2703.966501]  devm_drm_dev_init_release+0x42/0x70 [drm]
2024-12-25 12:00:54 [ 2703.972891]  release_nodes+0x6e/0xb0
2024-12-25 12:00:54 [ 2703.977522]  amdgpu_xcp_drv_release+0x38/0x80 [amdxcp]
2024-12-25 12:00:54 [ 2703.983906]  __do_sys_delete_module.constprop.0+0x138/0x2a0
2024-12-25 12:00:54 [ 2703.990775]  ? exit_to_user_mode_loop+0xd6/0x120
2024-12-25 12:00:54 [ 2703.996563]  do_syscall_64+0x2e/0x50
2024-12-25 12:00:54 [ 2704.001166]  entry_SYSCALL_64_after_hwframe+0x62/0xc7
2024-12-25 12:00:54 [ 2704.007432] RIP: 0033:0x7fa4db2620cb
2024-12-25 12:00:54 [ 2704.012025] Code: 73 01 c3 48 8b 0d a5 6d 19 00 f7 d8 64 89 01 48 83 c8 ff c3 66 2e 0f 1f 84 00 00 00 00 00 90 f3 0f 1e fa b8 b0
00 00 00 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d 75 6d 19 00 f7 d8 64 89 01 48

Signed-off-by: Jiang Liu <a class="moz-txt-link-rfc2396E" href="mailto:gerry@linux.alibaba.com" moz-do-not-send="true"><gerry@linux.alibaba.com></a>
---
 drivers/gpu/drm/amd/amdkfd/kfd_device.c | 7 ++++++-
 1 file changed, 6 insertions(+), 1 deletion(-)

diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_device.c b/drivers/gpu/drm/amd/amdkfd/kfd_device.c
index b6c5ffd4630b..58c1b5fcf785 100644
--- a/drivers/gpu/drm/amd/amdkfd/kfd_device.c
+++ b/drivers/gpu/drm/amd/amdkfd/kfd_device.c
@@ -663,6 +663,8 @@ static void kfd_cleanup_nodes(struct kfd_dev *kfd, unsigned int num_nodes)
 
        for (i = 0; i < num_nodes; i++) {
                knode = kfd->nodes[i];
+               if (!knode)
+                       continue;</pre>
                        </blockquote><p class="">I wonder how knode can be null here?
                          <span style="white-space: pre-wrap" class="">kfd_cleanup_nodes</span>
                          is called by kgd2kfd_device_exit under
                          condition if (kfd->init_complete).
                          kfd->init_complete is true only after all
                          kfd node got initialized at
                          kgd2kfd_device_init. If kfd driver init fail 
                          kfd->init_complete would be false, then <span style="white-space: pre-wrap" class="">kfd_cleanup_node would not get called.</span></p>
                      </div>
                    </div>
                  </blockquote>
                  <div class="">Hi Xiaogang,</div>
                  <div class=""><span class="Apple-tab-span" style="white-space:pre"> </span>It may get triggered
                    on error handling path of ‘kid2kfd_device_init()`,
                    when it jump to label `node_alloc_error` and </div>
                  <div class="">then call `kfd_cleanup_nodes()`.</div>
                  <div class=""><br class="">
                  </div>
                </div>
              </blockquote><p class="">If it was the case kzalloc failed. That means
                there is no memory available even to allocate struct
                kfd_node. From the backtrace the <span style="white-space: pre-wrap" class="">general protection fault happened at </span></p>
              <pre wrap="" class="moz-quote-pre">RIP: 0010:kgd2kfd_device_exit+0x6/0x60 [amdgpu]

It happened during driver module got released, not init time. I do not see how the patch is related to the issue you talked.
</pre>
            </div>
          </div>
        </blockquote>
        <div class="">Aha, it’s caused by another bug which got fixed by "<font face="PingFang SC" class=""><span style="" class="">[PATCH 4/6]
              amdgpu: fix use after free bug related to
              amdgpu_driver_release_kms()</span><span style="caret-color: rgba(0, 0, 0, 0.85);" class="">”</span><span style="" class="">.</span></font></div>
        <div class=""><span style="font-family: "PingFang SC";" class="">Without the fix in "[PATCH 4/6] amdgpu: fix use
            after free bug related to amdgpu_driver_release_kms()</span><span style="font-family: "PingFang SC"; caret-color: rgba(0, 0, 0, 0.85);" class="">”</span><span style="" class=""><font face="PingFang SC" class="">,
              kgd2kfd_device_exit() will got called</font></span></div>
        <div class=""><font face="PingFang SC" class=""><span style="caret-color: rgba(0, 0, 0, 0.85);" class="">twice</span><span style="" class=""> in case of
              device probe failure. I caused me some time to figure out
              the double free issue.</span></font></div>
        <div class=""><span style="" class=""><font face="PingFang SC" class="">Actually we should reset
              kfd->init_completed to false in function
              kgd2kfd_device_exit().</font></span></div>
      </div>
    </blockquote><p class=""><font face="PingFang SC" class="">We can set </font><span style="" class=""><font face="PingFang SC" class=""> kfd->init_completed = false at
          end of </font></span><span style="" class=""><font face="PingFang SC" class="">kgd2kfd_device_exit, but how </font></span><span style="" class=""><font face="PingFang SC" class="">kgd2kfd_device_exit was called two
          times? is there another bug caused that?</font></span></p></div></div></blockquote><div>I guess it caused by another bug related to the way amdgpu cooperates with the amdgpu_xcp driver. It would be better to enhance amdgpu_xcp driver either.</div><br class=""><blockquote type="cite" class=""><div class=""><div class=""><p class=""><span style="" class=""><font face="PingFang SC" class="">Regards</font></span></p><p class=""><span style="" class=""><font face="PingFang SC" class="">Xiaogang<br class="">
        </font></span></p><div class=""><span style="" class=""></span><br class="webkit-block-placeholder"></div>
    <blockquote type="cite" cite="mid:DFEBAA6C-D1D8-42BD-8934-58011EBDBFF4@linux.alibaba.com" class="">
      <div class="">
        <div class=""><br class="">
        </div>
        <blockquote type="cite" class="">
          <div class="">
            <div class="">
              <pre wrap="" class="moz-quote-pre">Regards
Xiaogang


</pre>
              <div class=""><br class="webkit-block-placeholder">
              </div>
              <blockquote type="cite" cite="mid:3CAD4155-244E-44EC-9EC4-D441E17DBEA2@linux.alibaba.com" class="">
                <div class="">
                  <div class="">Thanks,</div>
                  <div class="">Gerry</div>
                  <br class="">
                  <blockquote type="cite" class="">
                    <div class="">
                      <div class="">
                        <div class=""><br class="webkit-block-placeholder">
                        </div><p class=""><span style="white-space: pre-wrap" class="">Regards</span></p><p class=""><span style="white-space: pre-wrap" class="">Xiaogang
</span></p>
                        <blockquote type="cite" cite="mid:7aace7d239b729340e311ad6e08a14f60b87a361.1735795671.git.gerry@linux.alibaba.com" class="">
                          <pre wrap="" class="moz-quote-pre">                 device_queue_manager_uninit(knode->dqm);
                kfd_interrupt_exit(knode);
                kfd_topology_remove_device(knode);
@@ -954,7 +956,10 @@ void kgd2kfd_device_exit(struct kfd_dev *kfd)
                kfd_doorbell_fini(kfd);
                ida_destroy(&kfd->doorbell_ida);
                kfd_gtt_sa_fini(kfd);
-               amdgpu_amdkfd_free_gtt_mem(kfd->adev, &kfd->gtt_mem);
+               if (kfd->gtt_mem) {
+                       amdgpu_amdkfd_free_gtt_mem(kfd->adev, &kfd->gtt_mem);
+                       kfd->gtt_mem = NULL;
+               }
        }
 
        kfree(kfd);
</pre>
                        </blockquote>
                      </div>
                    </div>
                  </blockquote>
                </div>
                <br class="">
              </blockquote>
            </div>
          </div>
        </blockquote>
      </div>
      <br class="">
    </blockquote>
  </div>

</div></blockquote></div><br class=""></body></html>