[PATCH 2/6] amdgpu: fix invalid memory access in kfd_cleanup_nodes()

Fri Jan 3 17:33:44 UTC 2025

On 1/3/2025 1:05 AM, Gerry Liu wrote:
>
>
>> 2025年1月3日 14:19，Chen, Xiaogang <xiaogang.chen at amd.com> 写道：
>>
>>
>> On 1/2/2025 11:55 PM, Gerry Liu wrote:
>>>
>>>
>>>> 2025年1月3日 13:44，Chen, Xiaogang <xiaogang.chen at amd.com> 写道：
>>>>
>>>>
>>>> On 1/2/2025 8:22 PM, Gerry Liu wrote:
>>>>>
>>>>>
>>>>>> 2025年1月3日 07:08，Chen, Xiaogang <xiaogang.chen at amd.com> 写道：
>>>>>>
>>>>>>
>>>>>> On 1/1/2025 11:36 PM, Jiang Liu wrote:
>>>>>>> On error recover path during device probe, it may trigger invalid
>>>>>>> memory access as below:
>>>>>>> 024-12-25 12:00:53 [ 2703.773040] general protection fault, probably for non-canonical address 0x52445f4749464e4f: 0000 [#1] SMP NOPTI
>>>>>>> 2024-12-25 12:00:53 [ 2703.785199] CPU: 157 PID: 151951 Comm: rmmod Kdump: loaded Tainted: G        W  OE     5.10.134-17.2.al8.x86_64 #1
>>>>>>> 2024-12-25 12:00:53 [ 2703.797515] Hardware name: Alibaba Alibaba Cloud ECS/Alibaba Cloud ECS, BIOS 3.0.ES.AL.P.087.05 04/07/2024
>>>>>>> 2024-12-25 12:00:53 [ 2703.809188] RIP: 0010:kgd2kfd_device_exit+0x6/0x60 [amdgpu]
>>>>>>> 2024-12-25 12:00:53 [ 2703.816136] Code: ff 48 c7 83 38 01 00 00 80 45 e4 c0 c7 83 40 01 00 00 08 0f 00 00 e9 cd fa ff ff 66 0f 1f 84 00 00 00 00 00 0f
>>>>>>> 1f 44 00 00 55 <80> bf 28 01 00 00 00 48 89 fd 75 09 48 89 ef 5d e9 65 df 9d f4 8b
>>>>>>> 2024-12-25 12:00:54 [ 2703.838622] RSP: 0018:ffffb5313df07e10 EFLAGS: 00010202
>>>>>>> 2024-12-25 12:00:54 [ 2703.845216] RAX: 0000000000000000 RBX: ffff97ad689a3ff0 RCX: 0000000080400014
>>>>>>> 2024-12-25 12:00:54 [ 2703.853935] RDX: 0000000080400015 RSI: ffff97ad627e93d8 RDI: 52445f4749464e4f
>>>>>>> 2024-12-25 12:00:54 [ 2703.862652] RBP: ffff97ad689a3ff0 R08: 0000000000000000 R09: ffffffffb5814c00
>>>>>>> 2024-12-25 12:00:54 [ 2703.871368] R10: ffff97ad627e9280 R11: 0000000000000001 R12: ffffb5313df07e98
>>>>>>> 2024-12-25 12:00:54 [ 2703.880068] R13: ffff97ad689a1810 R14: 0000000000000001 R15: 0000000000000000
>>>>>>> 2024-12-25 12:00:54 [ 2703.888768] FS:  00007fa4db81e740(0000) GS:ffff98a93ec80000(0000) knlGS:0000000000000000
>>>>>>> 2024-12-25 12:00:54 [ 2703.898547] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
>>>>>>> 2024-12-25 12:00:54 [ 2703.905684] CR2: 00007f4502deca00 CR3: 000001008fc50001 CR4: 0000000002770ee0
>>>>>>> 2024-12-25 12:00:54 [ 2703.914382] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
>>>>>>> 2024-12-25 12:00:54 [ 2703.923066] DR3: 0000000000000000 DR6: 00000000fffe07f0 DR7: 0000000000000400
>>>>>>> 2024-12-25 12:00:54 [ 2703.931746] PKRU: 55555554
>>>>>>> 2024-12-25 12:00:54 [ 2703.935444] Call Trace:
>>>>>>> 2024-12-25 12:00:54 [ 2703.938962]  amdgpu_amdkfd_device_fini_sw+0x1a/0x40 [amdgpu]
>>>>>>> 2024-12-25 12:00:54 [ 2703.946080]  amdgpu_device_ip_fini.isra.0+0x3d/0x1b0 [amdgpu]
>>>>>>> 2024-12-25 12:00:54 [ 2703.953278]  amdgpu_device_fini_sw+0x2a/0x240 [amdgpu]
>>>>>>> 2024-12-25 12:00:54 [ 2703.959789]  amdgpu_driver_release_kms+0x12/0x30 [amdgpu]
>>>>>>> 2024-12-25 12:00:54 [ 2703.966501]  devm_drm_dev_init_release+0x42/0x70 [drm]
>>>>>>> 2024-12-25 12:00:54 [ 2703.972891]  release_nodes+0x6e/0xb0
>>>>>>> 2024-12-25 12:00:54 [ 2703.977522]  amdgpu_xcp_drv_release+0x38/0x80 [amdxcp]
>>>>>>> 2024-12-25 12:00:54 [ 2703.983906]  __do_sys_delete_module.constprop.0+0x138/0x2a0
>>>>>>> 2024-12-25 12:00:54 [ 2703.990775]  ? exit_to_user_mode_loop+0xd6/0x120
>>>>>>> 2024-12-25 12:00:54 [ 2703.996563]  do_syscall_64+0x2e/0x50
>>>>>>> 2024-12-25 12:00:54 [ 2704.001166]  entry_SYSCALL_64_after_hwframe+0x62/0xc7
>>>>>>> 2024-12-25 12:00:54 [ 2704.007432] RIP: 0033:0x7fa4db2620cb
>>>>>>> 2024-12-25 12:00:54 [ 2704.012025] Code: 73 01 c3 48 8b 0d a5 6d 19 00 f7 d8 64 89 01 48 83 c8 ff c3 66 2e 0f 1f 84 00 00 00 00 00 90 f3 0f 1e fa b8 b0
>>>>>>> 00 00 00 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d 75 6d 19 00 f7 d8 64 89 01 48
>>>>>>>
>>>>>>> Signed-off-by: Jiang Liu<gerry at linux.alibaba.com>
>>>>>>> ---
>>>>>>>   drivers/gpu/drm/amd/amdkfd/kfd_device.c | 7 ++++++-
>>>>>>>   1 file changed, 6 insertions(+), 1 deletion(-)
>>>>>>>
>>>>>>> diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_device.c b/drivers/gpu/drm/amd/amdkfd/kfd_device.c
>>>>>>> index b6c5ffd4630b..58c1b5fcf785 100644
>>>>>>> --- a/drivers/gpu/drm/amd/amdkfd/kfd_device.c
>>>>>>> +++ b/drivers/gpu/drm/amd/amdkfd/kfd_device.c
>>>>>>> @@ -663,6 +663,8 @@ static void kfd_cleanup_nodes(struct kfd_dev *kfd, unsigned int num_nodes)
>>>>>>>   
>>>>>>>   	for (i = 0; i < num_nodes; i++) {
>>>>>>>   		knode = kfd->nodes[i];
>>>>>>> +		if (!knode)
>>>>>>> +			continue;
>>>>>>
>>>>>> I wonder how knode can be null here? kfd_cleanup_nodes is called 
>>>>>> by kgd2kfd_device_exit under condition if (kfd->init_complete). 
>>>>>> kfd->init_complete is true only after all kfd node got 
>>>>>> initialized at kgd2kfd_device_init. If kfd driver init fail 
>>>>>> kfd->init_complete would be false, then kfd_cleanup_node would 
>>>>>> not get called.
>>>>>>
>>>>> Hi Xiaogang,
>>>>> It may get triggered on error handling path of 
>>>>> ‘kid2kfd_device_init()`, when it jump to label `node_alloc_error` and
>>>>> then call `kfd_cleanup_nodes()`.
>>>>>
>>>> If it was the case kzalloc failed. That means there is no memory 
>>>> available even to allocate struct kfd_node. From the backtrace the 
>>>> general protection fault happened at
>>>>
>>>> RIP: 0010:kgd2kfd_device_exit+0x6/0x60 [amdgpu]
>>>>
>>>> It happened during driver module got released, not init time. I do not see how the patch is related to the issue you talked.
>>> Aha, it’s caused by another bug which got fixed by "[PATCH 4/6] 
>>> amdgpu: fix use after free bug related to amdgpu_driver_release_kms()”.
>>> Without the fix in "[PATCH 4/6] amdgpu: fix use after free bug 
>>> related to amdgpu_driver_release_kms()”, kgd2kfd_device_exit() will 
>>> got called
>>> twice in case of device probe failure. I caused me some time to 
>>> figure out the double free issue.
>>> Actually we should reset kfd->init_completed to false in function 
>>> kgd2kfd_device_exit().
>>
>> We can set kfd->init_completed = false at end of kgd2kfd_device_exit, 
>> but how kgd2kfd_device_exit was called two times? is there another 
>> bug caused that?
>>
> I guess it caused by another bug related to the way amdgpu cooperates 
> with the amdgpu_xcp driver. It would be better to enhance amdgpu_xcp 
> driver either.

kfd driver has considered which kfd nodes got initialized and release 
them accordingly. From what saw here seems you may mix different issues 
or not target the real issue. Let's have backtrace match the changes.

Regards

Xiaogang

>
>> Regards
>>
>> Xiaogang
>>
>>
>>>
>>>> Regards
>>>> Xiaogang
>>>>
>>>>
>>>>
>>>>> Thanks,
>>>>> Gerry
>>>>>
>>>>>>
>>>>>> Regards
>>>>>>
>>>>>> Xiaogang
>>>>>>
>>>>>>>   		device_queue_manager_uninit(knode->dqm);
>>>>>>>   		kfd_interrupt_exit(knode);
>>>>>>>   		kfd_topology_remove_device(knode);
>>>>>>> @@ -954,7 +956,10 @@ void kgd2kfd_device_exit(struct kfd_dev *kfd)
>>>>>>>   		kfd_doorbell_fini(kfd);
>>>>>>>   		ida_destroy(&kfd->doorbell_ida);
>>>>>>>   		kfd_gtt_sa_fini(kfd);
>>>>>>> -		amdgpu_amdkfd_free_gtt_mem(kfd->adev, &kfd->gtt_mem);
>>>>>>> +		if (kfd->gtt_mem) {
>>>>>>> +			amdgpu_amdkfd_free_gtt_mem(kfd->adev, &kfd->gtt_mem);
>>>>>>> +			kfd->gtt_mem = NULL;
>>>>>>> +		}
>>>>>>>   	}
>>>>>>>   
>>>>>>>   	kfree(kfd);
>>>>>
>>>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://lists.freedesktop.org/archives/amd-gfx/attachments/20250103/5ee4b370/attachment-0001.htm>