REGRESSION drm/amdgpu: Radeon 7900 XTX hang & gpu_sched "Trying to push to a killed entity" since 1f02f2044bda (6.17-rc)
Alex Deucher
alexdeucher at gmail.com
Mon Aug 18 14:40:49 UTC 2025
On Mon, Aug 18, 2025 at 10:32 AM Mikhail Gavrilov
<mikhail.v.gavrilov at gmail.com> wrote:
>
> Hi Gang,
>
> Between commits 4b290aae788e and 89748acdf226 my Radeon 7900 XTX
> starts hanging when Steam performs shader compilation, with the
> following messages/stack trace:
>
> [ 9254.082549] kworker/u129:2 (15855) used greatest stack depth: 19656
> bytes left
> [ 9435.589185] [drm:drm_sched_entity_push_job [gpu_sched]] *ERROR*
> Trying to push to a killed entity
> [ 9435.590340] [drm:drm_sched_entity_push_job [gpu_sched]] *ERROR*
> Trying to push to a killed entity
> [ 9435.590465] [drm:drm_sched_entity_push_job [gpu_sched]] *ERROR*
> Trying to push to a killed entity
> [ 9435.590881] [drm:drm_sched_entity_push_job [gpu_sched]] *ERROR*
> Trying to push to a killed entity
> [ 9435.592513] [drm:drm_sched_entity_push_job [gpu_sched]] *ERROR*
> Trying to push to a killed entity
> [ 9435.594059] [drm:drm_sched_entity_push_job [gpu_sched]] *ERROR*
> Trying to push to a killed entity
> [ 9435.596428] [drm:drm_sched_entity_push_job [gpu_sched]] *ERROR*
> Trying to push to a killed entity
> [ 9435.597828] [drm:drm_sched_entity_push_job [gpu_sched]] *ERROR*
> Trying to push to a killed entity
> [ 9585.848993] INFO: task kworker/u132:12:18278 blocked for more than
> 122 seconds.
> [ 9585.849006] Tainted: G L ------ ---
> 6.17.0-0.rc0.250801g89748acdf226.7.fc43.x86_64+debug #1
> [ 9585.849010] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs"
> disables this message.
> [ 9585.849013] task:kworker/u132:12 state:D stack:28600 pid:18278
> tgid:18278 ppid:2 task_flags:0x4208060 flags:0x00004000
> [ 9585.849022] Workqueue: ttm ttm_bo_delayed_delete [ttm]
> [ 9585.849032] Call Trace:
> [ 9585.849034] <TASK>
> [ 9585.849037] __schedule+0x8d2/0x1be0
> [ 9585.849044] ? __pfx___schedule+0x10/0x10
> [ 9585.849051] ? __lock_release.isra.0+0x1cb/0x340
> [ 9585.849059] schedule+0xd4/0x260
> [ 9585.849062] schedule_timeout+0x17f/0x260
> [ 9585.849065] ? __pfx_schedule_timeout+0x10/0x10
> [ 9585.849067] ? find_held_lock+0x2b/0x80
> [ 9585.849074] ? lockdep_hardirqs_on_prepare.part.0+0x92/0x170
> [ 9585.849076] ? trace_hardirqs_on+0x18/0x150
> [ 9585.849081] dma_fence_default_wait+0x472/0x700
> [ 9585.849087] ? find_held_lock+0x2b/0x80
> [ 9585.849089] ? __pfx_dma_fence_default_wait+0x10/0x10
> [ 9585.849092] ? __pfx_dma_fence_default_wait_cb+0x10/0x10
> [ 9585.849095] ? mark_held_locks+0x40/0x70
> [ 9585.849098] ? lockdep_hardirqs_on_prepare.part.0+0x92/0x170
> [ 9585.849103] dma_fence_wait_timeout+0x344/0x540
> [ 9585.849107] dma_resv_wait_timeout+0xeb/0x190
> [ 9585.849111] ? __pfx_dma_resv_wait_timeout+0x10/0x10
> [ 9585.849117] ? rcu_is_watching+0x15/0xe0
> [ 9585.849122] ttm_bo_delayed_delete+0x34/0x100 [ttm]
> [ 9585.849128] process_one_work+0x87a/0x14d0
> [ 9585.849140] ? __pfx_process_one_work+0x10/0x10
> [ 9585.849145] ? find_held_lock+0x2b/0x80
> [ 9585.849153] ? assign_work+0x156/0x390
> [ 9585.849161] worker_thread+0x5f2/0xfd0
> [ 9585.849172] ? __pfx_worker_thread+0x10/0x10
> [ 9585.849175] kthread+0x3b0/0x770
> [ 9585.849178] ? local_clock_noinstr+0xf/0x130
> [ 9585.849182] ? __pfx_kthread+0x10/0x10
> [ 9585.849186] ? rcu_is_watching+0x15/0xe0
> [ 9585.849188] ? __pfx_kthread+0x10/0x10
> [ 9585.849192] ret_from_fork+0x3ef/0x510
> [ 9585.849196] ? __pfx_kthread+0x10/0x10
> [ 9585.849198] ? __pfx_kthread+0x10/0x10
> [ 9585.849201] ret_from_fork_asm+0x1a/0x30
> [ 9585.849210] </TASK>
>
> I can also reproduce the same error when starting a campaign in Halo
> Infinite (without relying on Steam’s shader pre-cache UI), which made
> bisecting feasible.
>
> 1f02f2044bda1db1fd995bc35961ab075fa7b5a2 is the first bad commit
> commit 1f02f2044bda1db1fd995bc35961ab075fa7b5a2 (HEAD)
> Author: Gang Ba <Gang.Ba at amd.com>
> Date: Tue Jul 8 14:36:13 2025 -0400
>
> drm/amdgpu: Avoid extra evict-restore process.
>
> If vm belongs to another process, this is fclose after fork,
> wait may enable signaling KFD eviction fence and cause parent
> process queue evicted.
>
> [677852.634569] amdkfd_fence_enable_signaling+0x56/0x70 [amdgpu]
> [677852.634814] __dma_fence_enable_signaling+0x3e/0xe0
> [677852.634820] dma_fence_wait_timeout+0x3a/0x140
> [677852.634825] amddma_resv_wait_timeout+0x7f/0xf0 [amdkcl]
> [677852.634831] amdgpu_vm_wait_idle+0x2d/0x60 [amdgpu]
> [677852.635026] amdgpu_flush+0x34/0x50 [amdgpu]
> [677852.635208] filp_flush+0x38/0x90
> [677852.635213] filp_close+0x14/0x30
> [677852.635216] do_close_on_exec+0xdd/0x130
> [677852.635221] begin_new_exec+0x1da/0x490
> [677852.635225] load_elf_binary+0x307/0xea0
> [677852.635231] ? srso_alias_return_thunk+0x5/0xfbef5
> [677852.635235] ? ima_bprm_check+0xa2/0xd0
> [677852.635240] search_binary_handler+0xda/0x260
> [677852.635245] exec_binprm+0x58/0x1a0
> [677852.635249] bprm_execve.part.0+0x16f/0x210
> [677852.635254] bprm_execve+0x45/0x80
> [677852.635257] do_execveat_common.isra.0+0x190/0x200
>
> Suggested-by: Christian König <christian.koenig at amd.com>
> Signed-off-by: Gang Ba <Gang.Ba at amd.com>
> Reviewed-by: Christian König <christian.koenig at amd.com>
> Signed-off-by: Alex Deucher <alexander.deucher at amd.com>
> Cc: stable at vger.kernel.org
>
> drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c | 6 ++----
> 1 file changed, 2 insertions(+), 4 deletions(-)
>
> Reverting 1f02f2044bda on top of 6.17-rc2 fully eliminates the hang on
> my system.
Should be fixed in:
https://web.git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=aa5fc4362fac9351557eb27c745579159a2e4520
Alex
>
> Environment:
> GPU: AMD Radeon 7900 XTX
> Kernel: 6.17-rc2
> Distro: Fedora Rawhide
> Hardware probe: https://linux-hardware.org/?probe=99f5cf44a4
> Kernel config and full dmesg are attached.
>
> Reproducer (two ways)
> Launch Steam and trigger shader compilation (automatic background
> pre-cache or any action that compiles shaders).
> Alternatively, launch Halo Infinite and start a campaign. Within a
> short time the GPU hang occurs and gpu_sched prints repeated
> *ERROR* Trying to push to a killed entity, followed by a blocked
> ttm_bo_delayed_delete worker as in the trace above.
>
> Impact / notes
> This is a runtime GPU hang on a current RDNA3 card; the offending
> commit is CC’d to stable, so it would be good to fix or revert before
> it propagates.
>
> Thanks for looking into this.
>
> --
> Best Regards,
> Mike Gavrilov.
More information about the amd-gfx
mailing list