[PATCH] drm/amdgpu: delete function amdgpu_flush
Christian König
christian.koenig at amd.com
Thu Jul 3 07:16:56 UTC 2025
On 02.07.25 18:40, Philip Yang wrote:
>
> On 2025-07-01 03:28, Christian König wrote:
>> Clear NAK to removing this!
>>
>> The amdgpu_flush function is vital for correct operation.
> no fflush call from libdrm/amdgpu, so amdgpu_flush is only called from fclose -> filp_flush
>> The intention is to block closing the file handle in child processes and wait for all previous operations to complete.
>
> Child process cannot share amdgpu vm with parent process, child process should open drm file node to create and use new amdgpu vm. Can you elaborate the intention why child process close the inherited drm file handle to wait for parent process operations done?
No, that goes to deep and I really don't have time for that.
Regards,
Christian.
>
> Regards,
>
> Philip
>
>>
>> Regards,
>> Christian.
>>
>> On 01.07.25 07:35, YuanShang Mao (River) wrote:
>>> [AMD Official Use Only - AMD Internal Distribution Only]
>>>
>>> @Yang, Philip
>>>> I notice KFD has another different issue with fclose -> amdgpu_flush,
>>>> that fork evict parent process queues when child process close the
>>>> inherited drm node file handle, amdgpu_flush will signal parent process
>>>> KFD eviction fence added to vm root bo resv, this cause performance drop
>>>> if python application uses lots of popen.
>>> Yes. Closing inherited drm node file handle will evict parent process queues, since drm share vm with kfd.
>>>
>>>> function amdgpu_ctx_mgr_entity_flush is only called by amdgpu_flush, can
>>>> be removed too.
>>> Sure. If we decide to remove amdgpu_flush.
>>>
>>> @Koenig, Christian @Deucher, Alexander, do you have any concern on removal of amdgpu_flush?
>>>
>>> Thanks
>>> River
>>>
>>>
>>> -----Original Message-----
>>> From: Yang, Philip <Philip.Yang at amd.com>
>>> Sent: Friday, June 27, 2025 10:44 PM
>>> To: YuanShang Mao (River) <YuanShang.Mao at amd.com>; amd-gfx at lists.freedesktop.org
>>> Cc: Yin, ZhenGuo (Chris) <ZhenGuo.Yin at amd.com>; cao, lin <lin.cao at amd.com>; Deng, Emily <Emily.Deng at amd.com>; Deucher, Alexander <Alexander.Deucher at amd.com>
>>> Subject: Re: [PATCH] drm/amdgpu: delete function amdgpu_flush
>>>
>>>
>>> On 2025-06-27 01:20, YuanShang Mao (River) wrote:
>>>> [AMD Official Use Only - AMD Internal Distribution Only]
>>>>
>>>> Currently, amdgpu_flush is used to prevent new jobs from being submitted in the same context when a file descriptor is closed and to wait for existing jobs to complete. Additionally, if the current process is in an exit state and the latest job of the entity was submitted by this process, the entity is terminated.
>>>>
>>>> There is an issue where, if drm scheduler is not woken up for some reason, the amdgpu_flush will remain hung, and another process holding this file cannot submit a job to wake up the drm scheduler.
>>> I notice KFD has another different issue with fclose -> amdgpu_flush,
>>> that fork evict parent process queues when child process close the
>>> inherited drm node file handle, amdgpu_flush will signal parent process
>>> KFD eviction fence added to vm root bo resv, this cause performance drop
>>> if python application uses lots of popen.
>>>
>>> [677852.634569] amdkfd_fence_enable_signaling+0x56/0x70 [amdgpu]
>>> [677852.634814] __dma_fence_enable_signaling+0x3e/0xe0
>>> [677852.634820] dma_fence_wait_timeout+0x3a/0x140
>>> [677852.634825] amddma_resv_wait_timeout+0x7f/0xf0 [amdkcl]
>>> [677852.634831] amdgpu_vm_wait_idle+0x2d/0x60 [amdgpu]
>>> [677852.635026] amdgpu_flush+0x34/0x50 [amdgpu]
>>> [677852.635208] filp_flush+0x38/0x90
>>> [677852.635213] filp_close+0x14/0x30
>>> [677852.635216] do_close_on_exec+0xdd/0x130
>>> [677852.635221] begin_new_exec+0x1da/0x490
>>> [677852.635225] load_elf_binary+0x307/0xea0
>>> [677852.635231] ? srso_alias_return_thunk+0x5/0xfbef5
>>> [677852.635235] ? ima_bprm_check+0xa2/0xd0
>>> [677852.635240] search_binary_handler+0xda/0x260
>>> [677852.635245] exec_binprm+0x58/0x1a0
>>> [677852.635249] bprm_execve.part.0+0x16f/0x210
>>> [677852.635254] bprm_execve+0x45/0x80
>>> [677852.635257] do_execveat_common.isra.0+0x190/0x200
>>>
>>>> The intended purpose of the flush operation in linux is to flush the content written by the current process to the hardware, rather than shutting down related services upon the process's exit, which would prevent other processes from using them. Now, amdgpu_flush cannot execute concurrently with command submission ioctl, which also leads to performance degradation.
>>> fclose -> filp_flush -> fput, if fput release the last reference of drm
>>> node file handle, call amdgpu_driver_postclose_kms ->
>>> amdgpu_ctx_mgr_fini will flush the entities, so amdgpu_flush is not needed.
>>>
>>> I thought to add workaround to skip amdgpu_flush if (vm->task_info->tgid
>>> != current->group_leader->pid) for KFD, this patch will fix both gfx and
>>> KFD, one stone for two birds.
>>>
>>> function amdgpu_ctx_mgr_entity_flush is only called by amdgpu_flush, can
>>> be removed too.
>>>
>>> Regards,
>>>
>>> Philip
>>>
>>>> An example of a shared DRM file is when systemd stop the display manager; systemd will close the file descriptor of Xorg that it holds.
>>>>
>>>> Jun 11 16:24:24 ubuntu2404-2 kernel: amdgpu: amdgpu_ctx_get: locked by other task times 8811
>>>> Jun 11 16:24:24 ubuntu2404-2 kernel: amdgpu: owner stack:
>>>> Jun 11 16:24:24 ubuntu2404-2 kernel: task:(sd-rmrf) state:D stack:0 pid:3407 tgid:3407 ppid:1 flags:0x00004002
>>>> Jun 11 16:24:24 ubuntu2404-2 kernel: Call Trace:
>>>> Jun 11 16:24:24 ubuntu2404-2 kernel: <TASK>
>>>> Jun 11 16:24:24 ubuntu2404-2 kernel: __schedule+0x279/0x6b0
>>>> Jun 11 16:24:24 ubuntu2404-2 kernel: schedule+0x29/0xd0
>>>> Jun 11 16:24:24 ubuntu2404-2 kernel: amddrm_sched_entity_flush+0x13e/0x270 [amd_sched]
>>>> Jun 11 16:24:24 ubuntu2404-2 kernel: ? __pfx_autoremove_wake_function+0x10/0x10
>>>> Jun 11 16:24:24 ubuntu2404-2 kernel: amdgpu_ctx_mgr_entity_flush+0xd6/0x200 [amdgpu]
>>>> Jun 11 16:24:24 ubuntu2404-2 kernel: amdgpu_flush+0x29/0x50 [amdgpu]
>>>> Jun 11 16:24:24 ubuntu2404-2 kernel: filp_flush+0x38/0x90
>>>> Jun 11 16:24:24 ubuntu2404-2 kernel: filp_close+0x14/0x30
>>>> Jun 11 16:24:24 ubuntu2404-2 kernel: __close_range+0x1b0/0x230
>>>> Jun 11 16:24:24 ubuntu2404-2 kernel: __x64_sys_close_range+0x17/0x30
>>>> Jun 11 16:24:24 ubuntu2404-2 kernel: x64_sys_call+0x1e0f/0x25f0
>>>> Jun 11 16:24:24 ubuntu2404-2 kernel: do_syscall_64+0x7e/0x170
>>>> Jun 11 16:24:24 ubuntu2404-2 kernel: ? srso_return_thunk+0x5/0x5f
>>>> Jun 11 16:24:24 ubuntu2404-2 kernel: ? __count_memcg_events+0x86/0x160
>>>> Jun 11 16:24:24 ubuntu2404-2 kernel: ? srso_return_thunk+0x5/0x5f
>>>> Jun 11 16:24:24 ubuntu2404-2 kernel: ? count_memcg_events.constprop.0+0x2a/0x50
>>>> Jun 11 16:24:24 ubuntu2404-2 kernel: ? srso_return_thunk+0x5/0x5f
>>>> Jun 11 16:24:24 ubuntu2404-2 kernel: ? handle_mm_fault+0x1df/0x2d0
>>>> Jun 11 16:24:24 ubuntu2404-2 kernel: ? srso_return_thunk+0x5/0x5f
>>>> Jun 11 16:24:24 ubuntu2404-2 kernel: ? do_user_addr_fault+0x5d5/0x870
>>>> Jun 11 16:24:24 ubuntu2404-2 kernel: ? srso_return_thunk+0x5/0x5f
>>>> Jun 11 16:24:24 ubuntu2404-2 kernel: ? irqentry_exit_to_user_mode+0x43/0x250
>>>> Jun 11 16:24:24 ubuntu2404-2 kernel: ? srso_return_thunk+0x5/0x5f
>>>> Jun 11 16:24:24 ubuntu2404-2 kernel: ? irqentry_exit+0x43/0x50
>>>> Jun 11 16:24:24 ubuntu2404-2 kernel: ? srso_return_thunk+0x5/0x5f
>>>> Jun 11 16:24:24 ubuntu2404-2 kernel: ? exc_page_fault+0x96/0x1c0
>>>> Jun 11 16:24:24 ubuntu2404-2 kernel: entry_SYSCALL_64_after_hwframe+0x76/0x7e
>>>> Jun 11 16:24:24 ubuntu2404-2 kernel: RIP: 0033:0x762b6df1677b
>>>> Jun 11 16:24:24 ubuntu2404-2 kernel: RSP: 002b:00007ffdb20ad718 EFLAGS: 00000246 ORIG_RAX: 00000000000001b4
>>>> Jun 11 16:24:24 ubuntu2404-2 kernel: RAX: ffffffffffffffda RBX: 0000000000000000 RCX: 0000762b6df1677b
>>>> Jun 11 16:24:24 ubuntu2404-2 kernel: RDX: 0000000000000000 RSI: 000000007fffffff RDI: 0000000000000003
>>>> Jun 11 16:24:24 ubuntu2404-2 kernel: RBP: 00007ffdb20ad730 R08: 0000000000000000 R09: 0000000000000000
>>>> Jun 11 16:24:24 ubuntu2404-2 kernel: R10: 0000000000000008 R11: 0000000000000246 R12: 0000000000000007
>>>> Jun 11 16:24:24 ubuntu2404-2 kernel: R13: 0000000000000000 R14: 0000000000000000 R15: 0000000000000000
>>>> Jun 11 16:24:24 ubuntu2404-2 kernel: </TASK>
>>>> Jun 11 16:24:24 ubuntu2404-2 kernel: amdgpu: current stack:
>>>> Jun 11 16:24:24 ubuntu2404-2 kernel: task:Xorg state:R running task stack:0 pid:2343 tgid:2343 ppid:2341 flags:0x00000008
>>>> Jun 11 16:24:24 ubuntu2404-2 kernel: Call Trace:
>>>> Jun 11 16:24:24 ubuntu2404-2 kernel: <TASK>
>>>> Jun 11 16:24:24 ubuntu2404-2 kernel: sched_show_task+0x122/0x180
>>>> Jun 11 16:24:24 ubuntu2404-2 kernel: amdgpu_ctx_get+0xf6/0x120 [amdgpu]
>>>> Jun 11 16:24:24 ubuntu2404-2 kernel: amdgpu_cs_ioctl+0xb6/0x2110 [amdgpu]
>>>> Jun 11 16:24:24 ubuntu2404-2 kernel: ? srso_return_thunk+0x5/0x5f
>>>> Jun 11 16:24:24 ubuntu2404-2 kernel: ? update_cfs_group+0x111/0x120
>>>> Jun 11 16:24:24 ubuntu2404-2 kernel: ? srso_return_thunk+0x5/0x5f
>>>> Jun 11 16:24:24 ubuntu2404-2 kernel: ? enqueue_entity+0x3a6/0x550
>>>> Jun 11 16:24:24 ubuntu2404-2 kernel: ? __pfx_amdgpu_cs_ioctl+0x10/0x10 [amdgpu]
>>>> Jun 11 16:24:24 ubuntu2404-2 kernel: drm_ioctl_kernel+0xbc/0x120
>>>> Jun 11 16:24:24 ubuntu2404-2 kernel: ? srso_return_thunk+0x5/0x5f
>>>> Jun 11 16:24:24 ubuntu2404-2 kernel: drm_ioctl+0x2f6/0x5b0
>>>> Jun 11 16:24:24 ubuntu2404-2 kernel: ? __pfx_amdgpu_cs_ioctl+0x10/0x10 [amdgpu]
>>>> Jun 11 16:24:24 ubuntu2404-2 kernel: amdgpu_drm_ioctl+0x4e/0x90 [amdgpu]
>>>> Jun 11 16:24:24 ubuntu2404-2 kernel: __x64_sys_ioctl+0xa3/0xf0
>>>> Jun 11 16:24:24 ubuntu2404-2 kernel: x64_sys_call+0x11ad/0x25f0
>>>> Jun 11 16:24:24 ubuntu2404-2 kernel: do_syscall_64+0x7e/0x170
>>>> Jun 11 16:24:24 ubuntu2404-2 kernel: ? srso_return_thunk+0x5/0x5f
>>>> Jun 11 16:24:24 ubuntu2404-2 kernel: ? ksys_read+0xe6/0x100
>>>> Jun 11 16:24:24 ubuntu2404-2 kernel: ? srso_return_thunk+0x5/0x5f
>>>> Jun 11 16:24:24 ubuntu2404-2 kernel: ? idr_find+0xf/0x20
>>>> Jun 11 16:24:24 ubuntu2404-2 kernel: ? srso_return_thunk+0x5/0x5f
>>>> Jun 11 16:24:24 ubuntu2404-2 kernel: ? drm_syncobj_array_free+0x5a/0x80
>>>> Jun 11 16:24:24 ubuntu2404-2 kernel: ? srso_return_thunk+0x5/0x5f
>>>> Jun 11 16:24:24 ubuntu2404-2 kernel: ? drm_syncobj_reset_ioctl+0xbd/0xd0
>>>> Jun 11 16:24:24 ubuntu2404-2 kernel: ? __pfx_drm_syncobj_reset_ioctl+0x10/0x10
>>>> Jun 11 16:24:24 ubuntu2404-2 kernel: ? srso_return_thunk+0x5/0x5f
>>>> Jun 11 16:24:24 ubuntu2404-2 kernel: ? drm_ioctl_kernel+0xbc/0x120
>>>> Jun 11 16:24:24 ubuntu2404-2 kernel: ? srso_return_thunk+0x5/0x5f
>>>> Jun 11 16:24:24 ubuntu2404-2 kernel: ? __check_object_size.part.0+0x3a/0x150
>>>> Jun 11 16:24:24 ubuntu2404-2 kernel: ? srso_return_thunk+0x5/0x5f
>>>> Jun 11 16:24:24 ubuntu2404-2 kernel: ? _copy_to_user+0x41/0x60
>>>> Jun 11 16:24:24 ubuntu2404-2 kernel: ? srso_return_thunk+0x5/0x5f
>>>> Jun 11 16:24:24 ubuntu2404-2 kernel: ? drm_ioctl+0x326/0x5b0
>>>> Jun 11 16:24:24 ubuntu2404-2 kernel: ? __pfx_drm_syncobj_reset_ioctl+0x10/0x10
>>>> Jun 11 16:24:24 ubuntu2404-2 kernel: ? srso_return_thunk+0x5/0x5f
>>>> Jun 11 16:24:24 ubuntu2404-2 kernel: ? kvm_clock_get_cycles+0x18/0x40
>>>> Jun 11 16:24:24 ubuntu2404-2 kernel: ? srso_return_thunk+0x5/0x5f
>>>> Jun 11 16:24:24 ubuntu2404-2 kernel: ? __pm_runtime_suspend+0x7b/0xd0
>>>> Jun 11 16:24:24 ubuntu2404-2 kernel: ? srso_return_thunk+0x5/0x5f
>>>> Jun 11 16:24:24 ubuntu2404-2 kernel: ? amdgpu_drm_ioctl+0x70/0x90 [amdgpu]
>>>> Jun 11 16:24:24 ubuntu2404-2 kernel: ? srso_return_thunk+0x5/0x5f
>>>> Jun 11 16:24:24 ubuntu2404-2 kernel: ? __x64_sys_ioctl+0xbb/0xf0
>>>> Jun 11 16:24:24 ubuntu2404-2 kernel: ? srso_return_thunk+0x5/0x5f
>>>> Jun 11 16:24:24 ubuntu2404-2 kernel: ? syscall_exit_to_user_mode+0x4e/0x250
>>>> Jun 11 16:24:24 ubuntu2404-2 kernel: ? srso_return_thunk+0x5/0x5f
>>>> Jun 11 16:24:24 ubuntu2404-2 kernel: ? do_syscall_64+0x8a/0x170
>>>> Jun 11 16:24:24 ubuntu2404-2 kernel: ? srso_return_thunk+0x5/0x5f
>>>> Jun 11 16:24:24 ubuntu2404-2 kernel: ? do_syscall_64+0x8a/0x170
>>>> Jun 11 16:24:24 ubuntu2404-2 kernel: ? srso_return_thunk+0x5/0x5f
>>>> Jun 11 16:24:24 ubuntu2404-2 kernel: ? syscall_exit_to_user_mode+0x4e/0x250
>>>> Jun 11 16:24:24 ubuntu2404-2 kernel: ? srso_return_thunk+0x5/0x5f
>>>> Jun 11 16:24:24 ubuntu2404-2 kernel: ? do_syscall_64+0x8a/0x170
>>>> Jun 11 16:24:24 ubuntu2404-2 kernel: ? srso_return_thunk+0x5/0x5f
>>>> Jun 11 16:24:24 ubuntu2404-2 kernel: ? do_syscall_64+0x8a/0x170
>>>> Jun 11 16:24:24 ubuntu2404-2 kernel: ? srso_return_thunk+0x5/0x5f
>>>> Jun 11 16:24:24 ubuntu2404-2 kernel: ? do_syscall_64+0x8a/0x170
>>>> Jun 11 16:24:24 ubuntu2404-2 kernel: ? sysvec_apic_timer_interrupt+0x57/0xc0
>>>> Jun 11 16:24:24 ubuntu2404-2 kernel: entry_SYSCALL_64_after_hwframe+0x76/0x7e
>>>> Jun 11 16:24:24 ubuntu2404-2 kernel: RIP: 0033:0x7156c3524ded
>>>> Jun 11 16:24:24 ubuntu2404-2 kernel: Code: 04 25 28 00 00 00 48 89 45 c8 31 c0 48 8d 45 10 c7 45 b0 10 00 00 00 48 89 45 b8 48 8d 45 d0 48 89 45 c0 b8 10 00 00 00 0f 05 <89> c2 3d 00 f0 ff ff 77 1a 48 8b 45 c8 64 48 2b 04 25 28 00 00 00
>>>> Jun 11 16:24:24 ubuntu2404-2 kernel: RSP: 002b:00007ffe4afcc410 EFLAGS: 00000246 ORIG_RAX: 0000000000000010
>>>> Jun 11 16:24:24 ubuntu2404-2 kernel: RAX: ffffffffffffffda RBX: 0000578954b74cf8 RCX: 00007156c3524ded
>>>> Jun 11 16:24:24 ubuntu2404-2 kernel: RDX: 00007ffe4afcc4f0 RSI: 00000000c0186444 RDI: 0000000000000012
>>>> Jun 11 16:24:24 ubuntu2404-2 kernel: RBP: 00007ffe4afcc460 R08: 00007ffe4afcc7a0 R09: 00007ffe4afcc4b0
>>>> Jun 11 16:24:24 ubuntu2404-2 kernel: R10: 0000578954b862f0 R11: 0000000000000246 R12: 00000000c0186444
>>>> Jun 11 16:24:24 ubuntu2404-2 kernel: R13: 0000000000000012 R14: 0000000000000060 R15: 0000578954b46380
>>>> Jun 11 16:24:24 ubuntu2404-2 kernel: </TASK>
>>>>
>>>> Signed-off-by: YuanShang <YuanShang.Mao at amd.com>
>>>>
>>>> ---
>>>> drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c | 13 -------------
>>>> 1 file changed, 13 deletions(-)
>>>>
>>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c
>>>> index 2bb02fe9c880..ee6b59bfd798 100644
>>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c
>>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c
>>>> @@ -2947,22 +2947,9 @@ static const struct dev_pm_ops amdgpu_pm_ops = {
>>>> .runtime_idle = amdgpu_pmops_runtime_idle, };
>>>>
>>>> -static int amdgpu_flush(struct file *f, fl_owner_t id) -{
>>>> - struct drm_file *file_priv = f->private_data;
>>>> - struct amdgpu_fpriv *fpriv = file_priv->driver_priv;
>>>> - long timeout = MAX_WAIT_SCHED_ENTITY_Q_EMPTY;
>>>> -
>>>> - timeout = amdgpu_ctx_mgr_entity_flush(&fpriv->ctx_mgr, timeout);
>>>> - timeout = amdgpu_vm_wait_idle(&fpriv->vm, timeout);
>>>> -
>>>> - return timeout >= 0 ? 0 : timeout;
>>>> -}
>>>> -
>>>> static const struct file_operations amdgpu_driver_kms_fops = {
>>>> .owner = THIS_MODULE,
>>>> .open = drm_open,
>>>> - .flush = amdgpu_flush,
>>>> .release = drm_release,
>>>> .unlocked_ioctl = amdgpu_drm_ioctl,
>>>> .mmap = drm_gem_mmap,
>>>> --
>>>> 2.25.1
>>>>
More information about the amd-gfx
mailing list