<html><head><meta http-equiv="Content-Type" content="text/html; charset=utf-8"></head><body style="word-wrap: break-word; -webkit-nbsp-mode: space; line-break: after-white-space;" class=""><br class=""><div><br class=""><blockquote type="cite" class=""><div class="">2025年1月8日 06:55,Chen, Xiaogang <<a href="mailto:xiaogang.chen@amd.com" class="">xiaogang.chen@amd.com</a>> 写道:</div><br class="Apple-interchange-newline"><div class="">
<meta http-equiv="Content-Type" content="text/html; charset=utf-8" class="">
<div class=""><p class=""><br class="">
</p>
<div class="moz-cite-prefix">On 1/4/2025 8:45 PM, Jiang Liu wrote:<br class="">
</div>
<blockquote type="cite" cite="mid:d097f02c25ff44fced59b10ac72587f304a6637f.1736044362.git.gerry@linux.alibaba.com" class="">
<pre wrap="" class="moz-quote-pre">If some GPU device failed to probe, `rmmod amdgpu` will trigger a use
after free bug related to amdgpu_driver_release_kms() as:
2024-12-26 16:17:45 [16002.085540] BUG: kernel NULL pointer dereference, address: 0000000000000000
2024-12-26 16:17:45 [16002.093792] #PF: supervisor read access in kernel mode
2024-12-26 16:17:45 [16002.099993] #PF: error_code(0x0000) - not-present page
2024-12-26 16:17:45 [16002.106188] PGD 0 P4D 0
2024-12-26 16:17:45 [16002.109464] Oops: Oops: 0000 [#1] PREEMPT SMP NOPTI
2024-12-26 16:17:45 [16002.115372] CPU: 2 PID: 14375 Comm: rmmod Kdump: loaded Tainted: G W E 6.10.0+ #2
2024-12-26 16:17:45 [16002.125577] Hardware name: Alibaba Alibaba Cloud ECS/Alibaba Cloud ECS, BIOS 3.0.ES.AL.P.087.05 04/07/2024
2024-12-26 16:17:45 [16002.136858] RIP: 0010:drm_sched_fini+0x3f/0xe0 [gpu_sched]
2024-12-26 16:17:45 [16002.143463] Code: 60 c6 87 be 00 00 00 01 e8 ce e0 90 d8 48 8d bb 80 00 00 00 e8 c2 e0 90 d8 8b 43 20 85 c0 74 51 45 31 e4 48 8b
43 28 4d 63 ec <4a> 8b 2c e8 48 89 ef e8 f5 0e 59 d9 48 8b 45 10 48 8d 55 10 48 39
2024-12-26 16:17:45 [16002.164992] RSP: 0018:ffffb570dbbb7da8 EFLAGS: 00010246
2024-12-26 16:17:45 [16002.171316] RAX: 0000000000000000 RBX: ffff96b0fdadc878 RCX: 0000000000000000
2024-12-26 16:17:46 [16002.179784] RDX: 000fffffffe00000 RSI: 0000000000000000 RDI: ffff96b0fdadc8f8
2024-12-26 16:17:46 [16002.188252] RBP: ffff96b0fdadc800 R08: ffff97abbd035040 R09: ffffffff9ac52540
2024-12-26 16:17:46 [16002.196722] R10: 0000000000000000 R11: 0000000000000000 R12: 0000000000000000
2024-12-26 16:17:46 [16002.205179] R13: 0000000000000000 R14: ffff96b0fdadfc00 R15: 0000000000000000
2024-12-26 16:17:46 [16002.213648] FS: 00007f2737000740(0000) GS:ffff97abbd100000(0000) knlGS:0000000000000000
2024-12-26 16:17:46 [16002.223189] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
2024-12-26 16:17:46 [16002.230103] CR2: 0000000000000000 CR3: 000000011be3a005 CR4: 0000000000f70ef0
2024-12-26 16:17:46 [16002.238581] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
2024-12-26 16:17:46 [16002.247053] DR3: 0000000000000000 DR6: 00000000fffe07f0 DR7: 0000000000000400
e024se+0x3c/0x90 [amdxcp]
2024-12-26 16:17:46 [16002.337645] __do_sys_delete_module.constprop.0+0x176/0x310
2024-12-26 16:17:46 [16002.344324] do_syscall_64+0x5d/0x170
2024-12-26 16:17:46 [16002.348858] entry_SYSCALL_64_after_hwframe+0x76/0x7e
2024-12-26 16:17:46 [16002.354956] RIP: 0033:0x7f2736a620cb-12-26
Fix it by unplugging xcp drm devices when failed to probe GPU devices.
Signed-off-by: Jiang Liu <a class="moz-txt-link-rfc2396E" href="mailto:gerry@linux.alibaba.com"><gerry@linux.alibaba.com></a>
Tested-by: Shuo Liu <a class="moz-txt-link-rfc2396E" href="mailto:shuox.liu@linux.alibaba.com"><shuox.liu@linux.alibaba.com></a>
Reviewed-by: Lijo Lazar <a class="moz-txt-link-rfc2396E" href="mailto:lijo.lazar@amd.com"><lijo.lazar@amd.com></a>
---
drivers/gpu/drm/amd/amdgpu/amdgpu_kms.c | 4 +++-
1 file changed, 3 insertions(+), 1 deletion(-)
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_kms.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_kms.c
index d2a046736edd..9ebc0d47d1cb 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_kms.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_kms.c
@@ -165,8 +165,10 @@ int amdgpu_driver_load_kms(struct amdgpu_device *adev, unsigned long flags)
DRM_WARN("smart shift update failed\n");
out:
- if (r)
+ if (r) {
+ amdgpu_xcp_dev_unplug(adev);</pre>
</blockquote><p class="">You have made <span style="white-space: pre-wrap" class="">amdgpu_xcp_drm_dev_free, why still use </span><span style="white-space: pre-wrap" class="">amdgpu_xcp_dev_unplug</span> here? I
think you want undo <span style="white-space: pre-wrap" class="">amdgpu_xcp_drm_dev_alloc in error path. Why involve adev device unplug? It is a different scenario.</span></p></div></div></blockquote>Hi xiaogang,</div><div> I didn’t get your point yet. Current work flow is:</div><div>amdgpu_pci_probe()</div><div><span class="Apple-tab-span" style="white-space:pre"> </span>-> amdgpu_driver_load_kms()</div><div><span class="Apple-tab-span" style="white-space:pre"> </span>-> amdgpu_device_init()</div><div><span class="Apple-tab-span" style="white-space:pre"> </span>->amdgpu_discovery_set_ip_blocks()</div><div><span class="Apple-tab-span" style="white-space:pre"> </span>-> amdgpu_xcp_mgr_init()</div><div><span class="Apple-tab-span" style="white-space:pre"> </span>-> amdgpu_xcp_alloc_dev()</div><div><span class="Apple-tab-span" style="white-space:pre"> </span>-> amdgpu_xcp_dev_register()</div><div>And amdgpu_xcp_dev_unplug() undos work done by amdgpu_xcp_dev_register() and part of amdgpu_xcp_mgr_ini().</div><div>How about splitting amdgpu_xcp_dev_unplug() into amdgpuxcp_dev_deregister() and amdgpu_xcp_mgr_fini(), then things should get much clear.</div><div>And it seems the xcp_mgr object is leaked current, and there are resource leakage on error handling path of amdgpu_pci_probe. </div><div>Above proposal may also help to fix those resource leakages.</div><div>Thanks,</div><div>Gerry</div><div><blockquote type="cite" class=""><div class=""><div class=""><p class=""><span style="white-space: pre-wrap" class="">Regards</span></p><p class=""><span style="white-space: pre-wrap" class="">Xiaogang
</span></p>
<blockquote type="cite" cite="mid:d097f02c25ff44fced59b10ac72587f304a6637f.1736044362.git.gerry@linux.alibaba.com" class="">
<pre wrap="" class="moz-quote-pre"> amdgpu_driver_unload_kms(dev);
+ }
return r;
}
</pre>
</blockquote>
</div>
</div></blockquote></div><br class=""></body></html>