<html><head><meta http-equiv="Content-Type" content="text/html; charset=utf-8"></head><body style="word-wrap: break-word; -webkit-nbsp-mode: space; line-break: after-white-space;" class=""><br class=""><div><br class=""><blockquote type="cite" class=""><div class="">2025年1月4日 01:34,Chen, Xiaogang <<a href="mailto:xiaogang.chen@amd.com" class="">xiaogang.chen@amd.com</a>> 写道:</div><br class="Apple-interchange-newline"><div class=""><meta charset="UTF-8" class=""><br style="caret-color: rgb(0, 0, 0); font-family: Helvetica; font-size: 18px; font-style: normal; font-variant-caps: normal; font-weight: 400; letter-spacing: normal; text-align: start; text-indent: 0px; text-transform: none; white-space: normal; word-spacing: 0px; -webkit-text-stroke-width: 0px; text-decoration: none;" class=""><span style="caret-color: rgb(0, 0, 0); font-family: Helvetica; font-size: 18px; font-style: normal; font-variant-caps: normal; font-weight: 400; letter-spacing: normal; text-align: start; text-indent: 0px; text-transform: none; white-space: normal; word-spacing: 0px; -webkit-text-stroke-width: 0px; text-decoration: none; float: none; display: inline !important;" class="">On 1/3/2025 1:43 AM, Shuo Liu wrote:</span><br style="caret-color: rgb(0, 0, 0); font-family: Helvetica; font-size: 18px; font-style: normal; font-variant-caps: normal; font-weight: 400; letter-spacing: normal; text-align: start; text-indent: 0px; text-transform: none; white-space: normal; word-spacing: 0px; -webkit-text-stroke-width: 0px; text-decoration: none;" class=""><blockquote type="cite" style="font-family: Helvetica; font-size: 18px; font-style: normal; font-variant-caps: normal; font-weight: 400; letter-spacing: normal; orphans: auto; text-align: start; text-indent: 0px; text-transform: none; white-space: normal; widows: auto; word-spacing: 0px; -webkit-text-size-adjust: auto; -webkit-text-stroke-width: 0px; text-decoration: none;" class="">On Fri 3.Jan'25 at 15:02:38 +0800, Gerry Liu wrote:<br class=""><blockquote type="cite" class=""><br class=""><br class=""><blockquote type="cite" class="">2025年1月3日 13:58,Chen, Xiaogang <<a href="mailto:xiaogang.chen@amd.com" class="">xiaogang.chen@amd.com</a>> 写道:<br class=""><br class=""><br class=""><br class="">On 1/1/2025 11:36 PM, Jiang Liu wrote:<br class=""><blockquote type="cite" class="">If some GPU device failed to probe, `rmmod amdgpu` will trigger a use<br class="">after free bug related to amdgpu_driver_release_kms() as:<br class="">2024-12-26 16:17:45 [16002.085540] BUG: kernel NULL pointer dereference, address: 0000000000000000<br class="">2024-12-26 16:17:45 [16002.093792] #PF: supervisor read access in kernel mode<br class="">2024-12-26 16:17:45 [16002.099993] #PF: error_code(0x0000) - not-present page<br class="">2024-12-26 16:17:45 [16002.106188] PGD 0 P4D 0<br class="">2024-12-26 16:17:45 [16002.109464] Oops: Oops: 0000 [#1] PREEMPT SMP NOPTI<br class="">2024-12-26 16:17:45 [16002.115372] CPU: 2 PID: 14375 Comm: rmmod Kdump: loaded Tainted: G W E 6.10.0+ #2<br class="">2024-12-26 16:17:45 [16002.125577] Hardware name: Alibaba Alibaba Cloud ECS/Alibaba Cloud ECS, BIOS 3.0.ES.AL.P.087.05 04/07/2024<br class="">2024-12-26 16:17:45 [16002.136858] RIP: 0010:drm_sched_fini+0x3f/0xe0 [gpu_sched]<br class="">2024-12-26 16:17:45 [16002.143463] Code: 60 c6 87 be 00 00 00 01 e8 ce e0 90 d8 48 8d bb 80 00 00 00 e8 c2 e0 90 d8 8b 43 20 85 c0 74 51 45 31 e4 48 8b<br class="">43 28 4d 63 ec <4a> 8b 2c e8 48 89 ef e8 f5 0e 59 d9 48 8b 45 10 48 8d 55 10 48 39<br class="">2024-12-26 16:17:45 [16002.164992] RSP: 0018:ffffb570dbbb7da8 EFLAGS: 00010246<br class="">2024-12-26 16:17:45 [16002.171316] RAX: 0000000000000000 RBX: ffff96b0fdadc878 RCX: 0000000000000000<br class="">2024-12-26 16:17:46 [16002.179784] RDX: 000fffffffe00000 RSI: 0000000000000000 RDI: ffff96b0fdadc8f8<br class="">2024-12-26 16:17:46 [16002.188252] RBP: ffff96b0fdadc800 R08: ffff97abbd035040 R09: ffffffff9ac52540<br class="">2024-12-26 16:17:46 [16002.196722] R10: 0000000000000000 R11: 0000000000000000 R12: 0000000000000000<br class="">2024-12-26 16:17:46 [16002.205179] R13: 0000000000000000 R14: ffff96b0fdadfc00 R15: 0000000000000000<br class="">2024-12-26 16:17:46 [16002.213648] FS: 00007f2737000740(0000) GS:ffff97abbd100000(0000) knlGS:0000000000000000<br class="">2024-12-26 16:17:46 [16002.223189] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033<br class="">2024-12-26 16:17:46 [16002.230103] CR2: 0000000000000000 CR3: 000000011be3a005 CR4: 0000000000f70ef0<br class="">2024-12-26 16:17:46 [16002.238581] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000<br class="">2024-12-26 16:17:46 [16002.247053] DR3: 0000000000000000 DR6: 00000000fffe07f0 DR7: 0000000000000400<br class="">e024se+0x3c/0x90 [amdxcp]<br class="">2024-12-26 16:17:46 [16002.337645] __do_sys_delete_module.constprop.0+0x176/0x310<br class="">2024-12-26 16:17:46 [16002.344324] do_syscall_64+0x5d/0x170<br class="">2024-12-26 16:17:46 [16002.348858] entry_SYSCALL_64_after_hwframe+0x76/0x7e<br class="">2024-12-26 16:17:46 [16002.354956] RIP: 0033:0x7f2736a620cb-12-26<br class=""><br class="">Fix it by unplugging xcp drm devices when failed to probe GPU devices.<br class=""><br class="">Signed-off-by: Jiang Liu <<a href="mailto:gerry@linux.alibaba.com" class="">gerry@linux.alibaba.com</a>> <<a href="mailto:gerry@linux.alibaba.com" class="">mailto:gerry@linux.alibaba.com</a>><br class="">Tested-by: Shuo Liu <<a href="mailto:shuox.liu@linux.alibaba.com" class="">shuox.liu@linux.alibaba.com</a>> <<a href="mailto:shuox.liu@linux.alibaba.com" class="">mailto:shuox.liu@linux.alibaba.com</a>><br class="">---<br class=""> drivers/gpu/drm/amd/amdgpu/amdgpu_kms.c | 4 +++-<br class=""> drivers/gpu/drm/amd/amdgpu/amdgpu_xcp.c | 1 +<br class=""> 2 files changed, 4 insertions(+), 1 deletion(-)<br class=""><br class="">diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_kms.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_kms.c<br class="">index 5ffe1dad9622..e7f35e3a6d2d 100644<br class="">--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_kms.c<br class="">+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_kms.c<br class="">@@ -164,8 +164,10 @@ int amdgpu_driver_load_kms(struct amdgpu_device *adev, unsigned long flags)<br class=""> <span class="Apple-converted-space"> </span>DRM_WARN("smart shift update failed\n");<br class=""><br class=""> out:<br class="">- if (r)<br class="">+ if (r) {<br class="">+ amdgpu_xcp_dev_unplug(adev);<br class=""> <span class="Apple-converted-space"> </span>amdgpu_driver_unload_kms(dev);<br class="">+ }<br class=""><br class=""></blockquote>I wonder if you can call amdgpu_xcp_dev_unplug here. It will call drm_dev_unplug that unplugs a hotpluggable DRM device. In you case amdgpu_driver_load_kms failed during probe time. We need unwind amdgpu_driver_load_kms. Why need do unplug a DRM device?<br class=""><br class="">The backtrace show:<br class=""><br class="">2024-12-26 16:17:45 [16002.136858] RIP: 0010:drm_sched_fini+0x3f/0xe0 [gpu_sched]<br class="">has:<br class=""><br class="">2024-12-26 16:17:45 [16002.085540] BUG: kernel NULL pointer dereference, address: 0000000000000000<br class="">How this change is related to the invalid access above? Can you explain more?<br class=""><br class=""></blockquote>Per my understanding, `amdgpu_xcp_dev_unplug(adev)` is a workaround for design flaw of the amdgpu_xcp driver.<br class="">Currently the amdgpu_xcp driver assumes everything will go as expected and there’s no failure or GPU hot-lug operations.<br class="">So it only provides an interface `amdgpu_xcp_drm_dev_alloc()` to create xcp devices for a GPU instance, and another interface<br class="">`amdgpu_xcp_drv_release()` to destroy xcp devices for GPU instances. There’s no interface to undo what done by `amdgpu_xcp_drm_dev_alloc()`.<br class="">And I found `amdgpu_xcp_dev_unplug(adev)` can undo work done by `amdgpu_xcp_drm_dev_alloc()`.<br class=""><br class="">So it’s a workaround, the fundamental solution should be to enhance amdgpu_xcp driver to provide interface to unroll work<br class="">done by `amdgpu_xcp_drm_dev_alloc()`.<br class=""></blockquote>Agree. Actually, some ops of amdgpu_partition_driver cannot be reused<br class="">directly by amd_xcp drm device. DRM doesn't use its minor->dev as param of such callbacks, just like .release(). When<br class="">amdgpu_driver_release_kms() use a amd_xcp drm dev to find the `struct<br class="">amdgpu_device *adev`, unexcepted memory accesses.<br class=""></blockquote><br style="caret-color: rgb(0, 0, 0); font-family: Helvetica; font-size: 18px; font-style: normal; font-variant-caps: normal; font-weight: 400; letter-spacing: normal; text-align: start; text-indent: 0px; text-transform: none; white-space: normal; word-spacing: 0px; -webkit-text-stroke-width: 0px; text-decoration: none;" class=""><span style="caret-color: rgb(0, 0, 0); font-family: Helvetica; font-size: 18px; font-style: normal; font-variant-caps: normal; font-weight: 400; letter-spacing: normal; text-align: start; text-indent: 0px; text-transform: none; white-space: normal; word-spacing: 0px; -webkit-text-stroke-width: 0px; text-decoration: none; float: none; display: inline !important;" class="">I understand the issue is from xcp driver, but I do not see it is a right workaround. The solution should be unwinding amdgpu_xcp_drm_dev_alloc(), not unplug drm device though part of amdgpu_xcp_dev_unplug(adev) may meet what you want.</span><br style="caret-color: rgb(0, 0, 0); font-family: Helvetica; font-size: 18px; font-style: normal; font-variant-caps: normal; font-weight: 400; letter-spacing: normal; text-align: start; text-indent: 0px; text-transform: none; white-space: normal; word-spacing: 0px; -webkit-text-stroke-width: 0px; text-decoration: none;" class=""><br style="caret-color: rgb(0, 0, 0); font-family: Helvetica; font-size: 18px; font-style: normal; font-variant-caps: normal; font-weight: 400; letter-spacing: normal; text-align: start; text-indent: 0px; text-transform: none; white-space: normal; word-spacing: 0px; -webkit-text-stroke-width: 0px; text-decoration: none;" class=""><span style="caret-color: rgb(0, 0, 0); font-family: Helvetica; font-size: 18px; font-style: normal; font-variant-caps: normal; font-weight: 400; letter-spacing: normal; text-align: start; text-indent: 0px; text-transform: none; white-space: normal; word-spacing: 0px; -webkit-text-stroke-width: 0px; text-decoration: none; float: none; display: inline !important;" class="">still, the bracktrace has</span><br style="caret-color: rgb(0, 0, 0); font-family: Helvetica; font-size: 18px; font-style: normal; font-variant-caps: normal; font-weight: 400; letter-spacing: normal; text-align: start; text-indent: 0px; text-transform: none; white-space: normal; word-spacing: 0px; -webkit-text-stroke-width: 0px; text-decoration: none;" class=""><br style="caret-color: rgb(0, 0, 0); font-family: Helvetica; font-size: 18px; font-style: normal; font-variant-caps: normal; font-weight: 400; letter-spacing: normal; text-align: start; text-indent: 0px; text-transform: none; white-space: normal; word-spacing: 0px; -webkit-text-stroke-width: 0px; text-decoration: none;" class=""><span style="caret-color: rgb(0, 0, 0); font-family: Helvetica; font-size: 18px; font-style: normal; font-variant-caps: normal; font-weight: 400; letter-spacing: normal; text-align: start; text-indent: 0px; text-transform: none; white-space: normal; word-spacing: 0px; -webkit-text-stroke-width: 0px; text-decoration: none; float: none; display: inline !important;" class="">2024-12-26 16:17:45 [16002.136858] RIP: 0010:drm_sched_fini+0x3f/0xe0 [gpu_sched]</span><br style="caret-color: rgb(0, 0, 0); font-family: Helvetica; font-size: 18px; font-style: normal; font-variant-caps: normal; font-weight: 400; letter-spacing: normal; text-align: start; text-indent: 0px; text-transform: none; white-space: normal; word-spacing: 0px; -webkit-text-stroke-width: 0px; text-decoration: none;" class=""><br style="caret-color: rgb(0, 0, 0); font-family: Helvetica; font-size: 18px; font-style: normal; font-variant-caps: normal; font-weight: 400; letter-spacing: normal; text-align: start; text-indent: 0px; text-transform: none; white-space: normal; word-spacing: 0px; -webkit-text-stroke-width: 0px; text-decoration: none;" class=""><span style="caret-color: rgb(0, 0, 0); font-family: Helvetica; font-size: 18px; font-style: normal; font-variant-caps: normal; font-weight: 400; letter-spacing: normal; text-align: start; text-indent: 0px; text-transform: none; white-space: normal; word-spacing: 0px; -webkit-text-stroke-width: 0px; text-decoration: none; float: none; display: inline !important;" class="">How the root cause of that is from xcp driver?</span><br style="caret-color: rgb(0, 0, 0); font-family: Helvetica; font-size: 18px; font-style: normal; font-variant-caps: normal; font-weight: 400; letter-spacing: normal; text-align: start; text-indent: 0px; text-transform: none; white-space: normal; word-spacing: 0px; -webkit-text-stroke-width: 0px; text-decoration: none;" class=""><br style="caret-color: rgb(0, 0, 0); font-family: Helvetica; font-size: 18px; font-style: normal; font-variant-caps: normal; font-weight: 400; letter-spacing: normal; text-align: start; text-indent: 0px; text-transform: none; white-space: normal; word-spacing: 0px; -webkit-text-stroke-width: 0px; text-decoration: none;" class=""></div></blockquote><div>After a second thought, I will try to add a new patch to fix the root cause.</div><div>With the new patch applied, current workaround becomes the right fix then:)</div><br class=""><blockquote type="cite" class=""><div class=""><span style="caret-color: rgb(0, 0, 0); font-family: Helvetica; font-size: 18px; font-style: normal; font-variant-caps: normal; font-weight: 400; letter-spacing: normal; text-align: start; text-indent: 0px; text-transform: none; white-space: normal; word-spacing: 0px; -webkit-text-stroke-width: 0px; text-decoration: none; float: none; display: inline !important;" class="">Regards</span><br style="caret-color: rgb(0, 0, 0); font-family: Helvetica; font-size: 18px; font-style: normal; font-variant-caps: normal; font-weight: 400; letter-spacing: normal; text-align: start; text-indent: 0px; text-transform: none; white-space: normal; word-spacing: 0px; -webkit-text-stroke-width: 0px; text-decoration: none;" class=""><br style="caret-color: rgb(0, 0, 0); font-family: Helvetica; font-size: 18px; font-style: normal; font-variant-caps: normal; font-weight: 400; letter-spacing: normal; text-align: start; text-indent: 0px; text-transform: none; white-space: normal; word-spacing: 0px; -webkit-text-stroke-width: 0px; text-decoration: none;" class=""><span style="caret-color: rgb(0, 0, 0); font-family: Helvetica; font-size: 18px; font-style: normal; font-variant-caps: normal; font-weight: 400; letter-spacing: normal; text-align: start; text-indent: 0px; text-transform: none; white-space: normal; word-spacing: 0px; -webkit-text-stroke-width: 0px; text-decoration: none; float: none; display: inline !important;" class="">Xiaogang</span><br style="caret-color: rgb(0, 0, 0); font-family: Helvetica; font-size: 18px; font-style: normal; font-variant-caps: normal; font-weight: 400; letter-spacing: normal; text-align: start; text-indent: 0px; text-transform: none; white-space: normal; word-spacing: 0px; -webkit-text-stroke-width: 0px; text-decoration: none;" class=""><br style="caret-color: rgb(0, 0, 0); font-family: Helvetica; font-size: 18px; font-style: normal; font-variant-caps: normal; font-weight: 400; letter-spacing: normal; text-align: start; text-indent: 0px; text-transform: none; white-space: normal; word-spacing: 0px; -webkit-text-stroke-width: 0px; text-decoration: none;" class=""><blockquote type="cite" style="font-family: Helvetica; font-size: 18px; font-style: normal; font-variant-caps: normal; font-weight: 400; letter-spacing: normal; orphans: auto; text-align: start; text-indent: 0px; text-transform: none; white-space: normal; widows: auto; word-spacing: 0px; -webkit-text-size-adjust: auto; -webkit-text-stroke-width: 0px; text-decoration: none;" class=""><br class="">shuo<br class=""><blockquote type="cite" class=""><br class="">The code paths to trigger the use after free are:<br class="">1) in function amdgpu_xcp_dev_alloc(), we allocate a drm_device by calling amdgpu_xcp_drm_dev_alloc(), and then change the device’s driver to amdgpu_partition_driver.<br class="">2) in function amdgpu_xcp_dev_unplug(), it restores drm_device’s driver to the original device driver.<br class=""><br class="">In function amdgpu_driver_load_kms(), if we don’t call amdgpu_xcp_dev_unplug() on error recover path, the `xcp_dev[idx].driver` will still points to amdgpu_partition_driver.<br class="">Later when amdgpu_xcp_drv_release() gets called, it will call platform_device_unregister() -> amdgpu_partition_driver.release -> amdgpu_driver_release_kms().<br class="">Here when amdgpu_driver_release_kms() gets called, the corresponding amdgpu_device object has already been released on error recovery path in function amdgpu_pci_probe().<br class=""><br class="">So just like amdgpu_pci_remove(), we call amdgpu_xcp_gpu_dev_unplug() before calling amdkcl_drm_dev_release().<br class=""><blockquote type="cite" class="">Regards<br class=""><br class="">Xiaogang<br class=""><br class=""><br class=""><br class=""><br class=""><br class=""><blockquote type="cite" class=""> <span class="Apple-converted-space"> </span>return r;<br class=""> }<br class="">diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_xcp.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_xcp.c<br class="">index a6d456ec6aeb..ef4eaacf67f6 100644<br class="">--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_xcp.c<br class="">+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_xcp.c<br class="">@@ -382,6 +382,7 @@ void amdgpu_xcp_dev_unplug(struct amdgpu_device *adev)<br class=""> <span class="Apple-converted-space"> </span>p_ddev->primary->dev = adev->xcp_mgr->xcp[i].pdev;<br class=""> <span class="Apple-converted-space"> </span>p_ddev->driver = adev->xcp_mgr->xcp[i].driver;<br class=""> <span class="Apple-converted-space"> </span>p_ddev->vma_offset_manager = adev->xcp_mgr->xcp[i].vma_offset_manager;<br class="">+ adev->xcp_mgr->xcp[i].ddev = NULL;<br class=""> <span class="Apple-converted-space"> </span>}<br class=""> }</blockquote></blockquote></blockquote></blockquote></div></blockquote></div><br class=""></body></html>