[PATCH v2 4/6] amdgpu: fix use after free bug related to amdgpu_driver_release_kms()

Sun Jan 5 05:16:26 UTC 2025

On Sun  5.Jan'25 at 10:45:32 +0800, Jiang Liu wrote:
>If some GPU device failed to probe, `rmmod amdgpu` will trigger a use
>after free bug related to amdgpu_driver_release_kms() as:
>2024-12-26 16:17:45 [16002.085540] BUG: kernel NULL pointer dereference, address: 0000000000000000
>2024-12-26 16:17:45 [16002.093792] #PF: supervisor read access in kernel mode
>2024-12-26 16:17:45 [16002.099993] #PF: error_code(0x0000) - not-present page
>2024-12-26 16:17:45 [16002.106188] PGD 0 P4D 0
>2024-12-26 16:17:45 [16002.109464] Oops: Oops: 0000 [#1] PREEMPT SMP NOPTI
>2024-12-26 16:17:45 [16002.115372] CPU: 2 PID: 14375 Comm: rmmod Kdump: loaded Tainted: G        W   E      6.10.0+ #2
>2024-12-26 16:17:45 [16002.125577] Hardware name: Alibaba Alibaba Cloud ECS/Alibaba Cloud ECS, BIOS 3.0.ES.AL.P.087.05 04/07/2024
>2024-12-26 16:17:45 [16002.136858] RIP: 0010:drm_sched_fini+0x3f/0xe0 [gpu_sched]
>2024-12-26 16:17:45 [16002.143463] Code: 60 c6 87 be 00 00 00 01 e8 ce e0 90 d8 48 8d bb 80 00 00 00 e8 c2 e0 90 d8 8b 43 20 85 c0 74 51 45 31 e4 48 8b
>43 28 4d 63 ec <4a> 8b 2c e8 48 89 ef e8 f5 0e 59 d9 48 8b 45 10 48 8d 55 10 48 39
>2024-12-26 16:17:45 [16002.164992] RSP: 0018:ffffb570dbbb7da8 EFLAGS: 00010246
>2024-12-26 16:17:45 [16002.171316] RAX: 0000000000000000 RBX: ffff96b0fdadc878 RCX: 0000000000000000
>2024-12-26 16:17:46 [16002.179784] RDX: 000fffffffe00000 RSI: 0000000000000000 RDI: ffff96b0fdadc8f8
>2024-12-26 16:17:46 [16002.188252] RBP: ffff96b0fdadc800 R08: ffff97abbd035040 R09: ffffffff9ac52540
>2024-12-26 16:17:46 [16002.196722] R10: 0000000000000000 R11: 0000000000000000 R12: 0000000000000000
>2024-12-26 16:17:46 [16002.205179] R13: 0000000000000000 R14: ffff96b0fdadfc00 R15: 0000000000000000
>2024-12-26 16:17:46 [16002.213648] FS:  00007f2737000740(0000) GS:ffff97abbd100000(0000) knlGS:0000000000000000
>2024-12-26 16:17:46 [16002.223189] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
>2024-12-26 16:17:46 [16002.230103] CR2: 0000000000000000 CR3: 000000011be3a005 CR4: 0000000000f70ef0
>2024-12-26 16:17:46 [16002.238581] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
>2024-12-26 16:17:46 [16002.247053] DR3: 0000000000000000 DR6: 00000000fffe07f0 DR7: 0000000000000400
>e024se+0x3c/0x90 [amdxcp]
This calltrace seems corrupted.
You can use this one from my test.

BUG: unable to handle page fault for address: 000000010000008f
#PF: supervisor read access in kernel mode
#PF: error_code(0x0000) - not-present page PGD 0 P4D 0
Oops: Oops: 0000 [#1] PREEMPT SMP NOPTI
CPU: 139 PID: 10172 Comm: rmmod Kdump: loaded Tainted: G            E      6.10.0 #5
RIP: 0010:amdgpu_fence_driver_sw_fini+0x3b/0xd0 [amdgpu]
Code: 00 41 55 49 89 fd 41 54 49 81 c5 78 07 01 00 41 bc ff ff ff ff 55 53 eb 09 49 83 c6 08 4d 39 ee 74 7d 49 8b 2e 48 85 ed 74 ef <80> 7d 28 00 74 e9 48 83 7d 78 00 74 09 48 8d 7d 78 e8 ff 65 e1 ff
RSP: 0018:ffffb8e77badfde0 EFLAGS: 00010202
RAX: ffffffffc09e13b0 RBX: ffffa1461a4287f0 RCX: 00000000000000c0
RDX: 00000000000000c0 RSI: 00000000000000c0 RDI: ffffa1461a4287f0
RBP: 0000000100000067 R08: 00000000000000c0 R09: ffffa146112a6c48
R10: 0000000000040000 R11: 0000000000000007 R12: 00000000ffffffff
R13: ffffa1461a438f68 R14: ffffa1461a438b88 R15: 0000000000000000
FS:  00007f98ed908740(0000) GS:ffffa141ffd80000(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 000000010000008f CR3: 0000000127d40003 CR4: 0000000000f70ef0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000fffe07f0 DR7: 0000000000000400
PKRU: 55555554
Call Trace:
  <TASK>
  ? __die+0x24/0x70
  ? page_fault_oops+0x65/0x140
  ? exc_page_fault+0x68/0x140
  ? asm_exc_page_fault+0x26/0x30
  ? __pfx_amdgpu_driver_release_kms+0x10/0x10 [amdgpu]
  ? amdgpu_fence_driver_sw_fini+0x3b/0xd0 [amdgpu]
  amdgpu_device_fini_sw+0x16/0x220 [amdgpu]
  amdgpu_driver_release_kms+0x16/0x30 [amdgpu]
  devm_drm_dev_init_release+0x4e/0x70 [drm]
  release_nodes+0x35/0xb0
  devres_release_group+0xa0/0xf0
  amdgpu_xcp_drv_release+0x3c/0x90 [amdxcp]

shuo
>2024-12-26 16:17:46 [16002.337645]  __do_sys_delete_module.constprop.0+0x176/0x310
>2024-12-26 16:17:46 [16002.344324]  do_syscall_64+0x5d/0x170
>2024-12-26 16:17:46 [16002.348858]  entry_SYSCALL_64_after_hwframe+0x76/0x7e
>2024-12-26 16:17:46 [16002.354956] RIP: 0033:0x7f2736a620cb-12-26

>
>Fix it by unplugging xcp drm devices when failed to probe GPU devices.
>
>Signed-off-by: Jiang Liu <gerry at linux.alibaba.com>
>Tested-by: Shuo Liu <shuox.liu at linux.alibaba.com>
>Reviewed-by: Lijo Lazar <lijo.lazar at amd.com>
>---
> drivers/gpu/drm/amd/amdgpu/amdgpu_kms.c | 4 +++-
> 1 file changed, 3 insertions(+), 1 deletion(-)
>
>diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_kms.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_kms.c
>index d2a046736edd..9ebc0d47d1cb 100644
>--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_kms.c
>+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_kms.c
>@@ -165,8 +165,10 @@ int amdgpu_driver_load_kms(struct amdgpu_device *adev, unsigned long flags)
> 		DRM_WARN("smart shift update failed\n");
>
> out:
>-	if (r)
>+	if (r) {
>+		amdgpu_xcp_dev_unplug(adev);
> 		amdgpu_driver_unload_kms(dev);
>+	}
>
> 	return r;
> }
>-- 
>2.43.5