[PATCH 4/6] amdgpu: fix use after free bug related to amdgpu_driver_release_kms()
Chen, Xiaogang
xiaogang.chen at amd.com
Fri Jan 3 17:34:13 UTC 2025
On 1/3/2025 1:43 AM, Shuo Liu wrote:
> On Fri 3.Jan'25 at 15:02:38 +0800, Gerry Liu wrote:
>>
>>
>>> 2025年1月3日 13:58,Chen, Xiaogang <xiaogang.chen at amd.com> 写道:
>>>
>>>
>>>
>>> On 1/1/2025 11:36 PM, Jiang Liu wrote:
>>>> If some GPU device failed to probe, `rmmod amdgpu` will trigger a use
>>>> after free bug related to amdgpu_driver_release_kms() as:
>>>> 2024-12-26 16:17:45 [16002.085540] BUG: kernel NULL pointer
>>>> dereference, address: 0000000000000000
>>>> 2024-12-26 16:17:45 [16002.093792] #PF: supervisor read access in
>>>> kernel mode
>>>> 2024-12-26 16:17:45 [16002.099993] #PF: error_code(0x0000) -
>>>> not-present page
>>>> 2024-12-26 16:17:45 [16002.106188] PGD 0 P4D 0
>>>> 2024-12-26 16:17:45 [16002.109464] Oops: Oops: 0000 [#1] PREEMPT
>>>> SMP NOPTI
>>>> 2024-12-26 16:17:45 [16002.115372] CPU: 2 PID: 14375 Comm: rmmod
>>>> Kdump: loaded Tainted: G W E 6.10.0+ #2
>>>> 2024-12-26 16:17:45 [16002.125577] Hardware name: Alibaba Alibaba
>>>> Cloud ECS/Alibaba Cloud ECS, BIOS 3.0.ES.AL.P.087.05 04/07/2024
>>>> 2024-12-26 16:17:45 [16002.136858] RIP:
>>>> 0010:drm_sched_fini+0x3f/0xe0 [gpu_sched]
>>>> 2024-12-26 16:17:45 [16002.143463] Code: 60 c6 87 be 00 00 00 01 e8
>>>> ce e0 90 d8 48 8d bb 80 00 00 00 e8 c2 e0 90 d8 8b 43 20 85 c0 74
>>>> 51 45 31 e4 48 8b
>>>> 43 28 4d 63 ec <4a> 8b 2c e8 48 89 ef e8 f5 0e 59 d9 48 8b 45 10 48
>>>> 8d 55 10 48 39
>>>> 2024-12-26 16:17:45 [16002.164992] RSP: 0018:ffffb570dbbb7da8
>>>> EFLAGS: 00010246
>>>> 2024-12-26 16:17:45 [16002.171316] RAX: 0000000000000000 RBX:
>>>> ffff96b0fdadc878 RCX: 0000000000000000
>>>> 2024-12-26 16:17:46 [16002.179784] RDX: 000fffffffe00000 RSI:
>>>> 0000000000000000 RDI: ffff96b0fdadc8f8
>>>> 2024-12-26 16:17:46 [16002.188252] RBP: ffff96b0fdadc800 R08:
>>>> ffff97abbd035040 R09: ffffffff9ac52540
>>>> 2024-12-26 16:17:46 [16002.196722] R10: 0000000000000000 R11:
>>>> 0000000000000000 R12: 0000000000000000
>>>> 2024-12-26 16:17:46 [16002.205179] R13: 0000000000000000 R14:
>>>> ffff96b0fdadfc00 R15: 0000000000000000
>>>> 2024-12-26 16:17:46 [16002.213648] FS: 00007f2737000740(0000)
>>>> GS:ffff97abbd100000(0000) knlGS:0000000000000000
>>>> 2024-12-26 16:17:46 [16002.223189] CS: 0010 DS: 0000 ES: 0000 CR0:
>>>> 0000000080050033
>>>> 2024-12-26 16:17:46 [16002.230103] CR2: 0000000000000000 CR3:
>>>> 000000011be3a005 CR4: 0000000000f70ef0
>>>> 2024-12-26 16:17:46 [16002.238581] DR0: 0000000000000000 DR1:
>>>> 0000000000000000 DR2: 0000000000000000
>>>> 2024-12-26 16:17:46 [16002.247053] DR3: 0000000000000000 DR6:
>>>> 00000000fffe07f0 DR7: 0000000000000400
>>>> e024se+0x3c/0x90 [amdxcp]
>>>> 2024-12-26 16:17:46 [16002.337645]
>>>> __do_sys_delete_module.constprop.0+0x176/0x310
>>>> 2024-12-26 16:17:46 [16002.344324] do_syscall_64+0x5d/0x170
>>>> 2024-12-26 16:17:46 [16002.348858]
>>>> entry_SYSCALL_64_after_hwframe+0x76/0x7e
>>>> 2024-12-26 16:17:46 [16002.354956] RIP: 0033:0x7f2736a620cb-12-26
>>>>
>>>> Fix it by unplugging xcp drm devices when failed to probe GPU devices.
>>>>
>>>> Signed-off-by: Jiang Liu <gerry at linux.alibaba.com>
>>>> <mailto:gerry at linux.alibaba.com>
>>>> Tested-by: Shuo Liu <shuox.liu at linux.alibaba.com>
>>>> <mailto:shuox.liu at linux.alibaba.com>
>>>> ---
>>>> drivers/gpu/drm/amd/amdgpu/amdgpu_kms.c | 4 +++-
>>>> drivers/gpu/drm/amd/amdgpu/amdgpu_xcp.c | 1 +
>>>> 2 files changed, 4 insertions(+), 1 deletion(-)
>>>>
>>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_kms.c
>>>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_kms.c
>>>> index 5ffe1dad9622..e7f35e3a6d2d 100644
>>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_kms.c
>>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_kms.c
>>>> @@ -164,8 +164,10 @@ int amdgpu_driver_load_kms(struct
>>>> amdgpu_device *adev, unsigned long flags)
>>>> DRM_WARN("smart shift update failed\n");
>>>>
>>>> out:
>>>> - if (r)
>>>> + if (r) {
>>>> + amdgpu_xcp_dev_unplug(adev);
>>>> amdgpu_driver_unload_kms(dev);
>>>> + }
>>>>
>>> I wonder if you can call amdgpu_xcp_dev_unplug here. It will call
>>> drm_dev_unplug that unplugs a hotpluggable DRM device. In you case
>>> amdgpu_driver_load_kms failed during probe time. We need unwind
>>> amdgpu_driver_load_kms. Why need do unplug a DRM device?
>>>
>>> The backtrace show:
>>>
>>> 2024-12-26 16:17:45 [16002.136858] RIP:
>>> 0010:drm_sched_fini+0x3f/0xe0 [gpu_sched]
>>> has:
>>>
>>> 2024-12-26 16:17:45 [16002.085540] BUG: kernel NULL pointer
>>> dereference, address: 0000000000000000
>>> How this change is related to the invalid access above? Can you
>>> explain more?
>>>
>> Per my understanding, `amdgpu_xcp_dev_unplug(adev)` is a workaround
>> for design flaw of the amdgpu_xcp driver.
>> Currently the amdgpu_xcp driver assumes everything will go as
>> expected and there’s no failure or GPU hot-lug operations.
>> So it only provides an interface `amdgpu_xcp_drm_dev_alloc()` to
>> create xcp devices for a GPU instance, and another interface
>> `amdgpu_xcp_drv_release()` to destroy xcp devices for GPU instances.
>> There’s no interface to undo what done by `amdgpu_xcp_drm_dev_alloc()`.
>> And I found `amdgpu_xcp_dev_unplug(adev)` can undo work done by
>> `amdgpu_xcp_drm_dev_alloc()`.
>>
>> So it’s a workaround, the fundamental solution should be to enhance
>> amdgpu_xcp driver to provide interface to unroll work
>> done by `amdgpu_xcp_drm_dev_alloc()`.
> Agree. Actually, some ops of amdgpu_partition_driver cannot be reused
> directly by amd_xcp drm device. DRM doesn't use its minor->dev as
> param of such callbacks, just like .release(). When
> amdgpu_driver_release_kms() use a amd_xcp drm dev to find the `struct
> amdgpu_device *adev`, unexcepted memory accesses.
I understand the issue is from xcp driver, but I do not see it is a
right workaround. The solution should be unwinding
amdgpu_xcp_drm_dev_alloc(), not unplug drm device though part of
amdgpu_xcp_dev_unplug(adev) may meet what you want.
still, the bracktrace has
2024-12-26 16:17:45 [16002.136858] RIP: 0010:drm_sched_fini+0x3f/0xe0
[gpu_sched]
How the root cause of that is from xcp driver?
Regards
Xiaogang
>
> shuo
>>
>> The code paths to trigger the use after free are:
>> 1) in function amdgpu_xcp_dev_alloc(), we allocate a drm_device by
>> calling amdgpu_xcp_drm_dev_alloc(), and then change the device’s
>> driver to amdgpu_partition_driver.
>> 2) in function amdgpu_xcp_dev_unplug(), it restores drm_device’s
>> driver to the original device driver.
>>
>> In function amdgpu_driver_load_kms(), if we don’t call
>> amdgpu_xcp_dev_unplug() on error recover path, the
>> `xcp_dev[idx].driver` will still points to amdgpu_partition_driver.
>> Later when amdgpu_xcp_drv_release() gets called, it will call
>> platform_device_unregister() -> amdgpu_partition_driver.release ->
>> amdgpu_driver_release_kms().
>> Here when amdgpu_driver_release_kms() gets called, the corresponding
>> amdgpu_device object has already been released on error recovery path
>> in function amdgpu_pci_probe().
>>
>> So just like amdgpu_pci_remove(), we call amdgpu_xcp_gpu_dev_unplug()
>> before calling amdkcl_drm_dev_release().
>>> Regards
>>>
>>> Xiaogang
>>>
>>>
>>>
>>>
>>>
>>>> return r;
>>>> }
>>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_xcp.c
>>>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_xcp.c
>>>> index a6d456ec6aeb..ef4eaacf67f6 100644
>>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_xcp.c
>>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_xcp.c
>>>> @@ -382,6 +382,7 @@ void amdgpu_xcp_dev_unplug(struct amdgpu_device
>>>> *adev)
>>>> p_ddev->primary->dev = adev->xcp_mgr->xcp[i].pdev;
>>>> p_ddev->driver = adev->xcp_mgr->xcp[i].driver;
>>>> p_ddev->vma_offset_manager =
>>>> adev->xcp_mgr->xcp[i].vma_offset_manager;
>>>> + adev->xcp_mgr->xcp[i].ddev = NULL;
>>>> }
>>>> }
>>>>
>>
More information about the amd-gfx
mailing list