REGRESSION drm/amdgpu: Radeon 7900 XTX hang & gpu_sched "Trying to push to a killed entity" since 1f02f2044bda (6.17-rc)
Mikhail Gavrilov
mikhail.v.gavrilov at gmail.com
Mon Aug 18 14:32:10 UTC 2025
Hi Gang,
Between commits 4b290aae788e and 89748acdf226 my Radeon 7900 XTX
starts hanging when Steam performs shader compilation, with the
following messages/stack trace:
[ 9254.082549] kworker/u129:2 (15855) used greatest stack depth: 19656
bytes left
[ 9435.589185] [drm:drm_sched_entity_push_job [gpu_sched]] *ERROR*
Trying to push to a killed entity
[ 9435.590340] [drm:drm_sched_entity_push_job [gpu_sched]] *ERROR*
Trying to push to a killed entity
[ 9435.590465] [drm:drm_sched_entity_push_job [gpu_sched]] *ERROR*
Trying to push to a killed entity
[ 9435.590881] [drm:drm_sched_entity_push_job [gpu_sched]] *ERROR*
Trying to push to a killed entity
[ 9435.592513] [drm:drm_sched_entity_push_job [gpu_sched]] *ERROR*
Trying to push to a killed entity
[ 9435.594059] [drm:drm_sched_entity_push_job [gpu_sched]] *ERROR*
Trying to push to a killed entity
[ 9435.596428] [drm:drm_sched_entity_push_job [gpu_sched]] *ERROR*
Trying to push to a killed entity
[ 9435.597828] [drm:drm_sched_entity_push_job [gpu_sched]] *ERROR*
Trying to push to a killed entity
[ 9585.848993] INFO: task kworker/u132:12:18278 blocked for more than
122 seconds.
[ 9585.849006] Tainted: G L ------ ---
6.17.0-0.rc0.250801g89748acdf226.7.fc43.x86_64+debug #1
[ 9585.849010] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs"
disables this message.
[ 9585.849013] task:kworker/u132:12 state:D stack:28600 pid:18278
tgid:18278 ppid:2 task_flags:0x4208060 flags:0x00004000
[ 9585.849022] Workqueue: ttm ttm_bo_delayed_delete [ttm]
[ 9585.849032] Call Trace:
[ 9585.849034] <TASK>
[ 9585.849037] __schedule+0x8d2/0x1be0
[ 9585.849044] ? __pfx___schedule+0x10/0x10
[ 9585.849051] ? __lock_release.isra.0+0x1cb/0x340
[ 9585.849059] schedule+0xd4/0x260
[ 9585.849062] schedule_timeout+0x17f/0x260
[ 9585.849065] ? __pfx_schedule_timeout+0x10/0x10
[ 9585.849067] ? find_held_lock+0x2b/0x80
[ 9585.849074] ? lockdep_hardirqs_on_prepare.part.0+0x92/0x170
[ 9585.849076] ? trace_hardirqs_on+0x18/0x150
[ 9585.849081] dma_fence_default_wait+0x472/0x700
[ 9585.849087] ? find_held_lock+0x2b/0x80
[ 9585.849089] ? __pfx_dma_fence_default_wait+0x10/0x10
[ 9585.849092] ? __pfx_dma_fence_default_wait_cb+0x10/0x10
[ 9585.849095] ? mark_held_locks+0x40/0x70
[ 9585.849098] ? lockdep_hardirqs_on_prepare.part.0+0x92/0x170
[ 9585.849103] dma_fence_wait_timeout+0x344/0x540
[ 9585.849107] dma_resv_wait_timeout+0xeb/0x190
[ 9585.849111] ? __pfx_dma_resv_wait_timeout+0x10/0x10
[ 9585.849117] ? rcu_is_watching+0x15/0xe0
[ 9585.849122] ttm_bo_delayed_delete+0x34/0x100 [ttm]
[ 9585.849128] process_one_work+0x87a/0x14d0
[ 9585.849140] ? __pfx_process_one_work+0x10/0x10
[ 9585.849145] ? find_held_lock+0x2b/0x80
[ 9585.849153] ? assign_work+0x156/0x390
[ 9585.849161] worker_thread+0x5f2/0xfd0
[ 9585.849172] ? __pfx_worker_thread+0x10/0x10
[ 9585.849175] kthread+0x3b0/0x770
[ 9585.849178] ? local_clock_noinstr+0xf/0x130
[ 9585.849182] ? __pfx_kthread+0x10/0x10
[ 9585.849186] ? rcu_is_watching+0x15/0xe0
[ 9585.849188] ? __pfx_kthread+0x10/0x10
[ 9585.849192] ret_from_fork+0x3ef/0x510
[ 9585.849196] ? __pfx_kthread+0x10/0x10
[ 9585.849198] ? __pfx_kthread+0x10/0x10
[ 9585.849201] ret_from_fork_asm+0x1a/0x30
[ 9585.849210] </TASK>
I can also reproduce the same error when starting a campaign in Halo
Infinite (without relying on Steam’s shader pre-cache UI), which made
bisecting feasible.
1f02f2044bda1db1fd995bc35961ab075fa7b5a2 is the first bad commit
commit 1f02f2044bda1db1fd995bc35961ab075fa7b5a2 (HEAD)
Author: Gang Ba <Gang.Ba at amd.com>
Date: Tue Jul 8 14:36:13 2025 -0400
drm/amdgpu: Avoid extra evict-restore process.
If vm belongs to another process, this is fclose after fork,
wait may enable signaling KFD eviction fence and cause parent
process queue evicted.
[677852.634569] amdkfd_fence_enable_signaling+0x56/0x70 [amdgpu]
[677852.634814] __dma_fence_enable_signaling+0x3e/0xe0
[677852.634820] dma_fence_wait_timeout+0x3a/0x140
[677852.634825] amddma_resv_wait_timeout+0x7f/0xf0 [amdkcl]
[677852.634831] amdgpu_vm_wait_idle+0x2d/0x60 [amdgpu]
[677852.635026] amdgpu_flush+0x34/0x50 [amdgpu]
[677852.635208] filp_flush+0x38/0x90
[677852.635213] filp_close+0x14/0x30
[677852.635216] do_close_on_exec+0xdd/0x130
[677852.635221] begin_new_exec+0x1da/0x490
[677852.635225] load_elf_binary+0x307/0xea0
[677852.635231] ? srso_alias_return_thunk+0x5/0xfbef5
[677852.635235] ? ima_bprm_check+0xa2/0xd0
[677852.635240] search_binary_handler+0xda/0x260
[677852.635245] exec_binprm+0x58/0x1a0
[677852.635249] bprm_execve.part.0+0x16f/0x210
[677852.635254] bprm_execve+0x45/0x80
[677852.635257] do_execveat_common.isra.0+0x190/0x200
Suggested-by: Christian König <christian.koenig at amd.com>
Signed-off-by: Gang Ba <Gang.Ba at amd.com>
Reviewed-by: Christian König <christian.koenig at amd.com>
Signed-off-by: Alex Deucher <alexander.deucher at amd.com>
Cc: stable at vger.kernel.org
drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c | 6 ++----
1 file changed, 2 insertions(+), 4 deletions(-)
Reverting 1f02f2044bda on top of 6.17-rc2 fully eliminates the hang on
my system.
Environment:
GPU: AMD Radeon 7900 XTX
Kernel: 6.17-rc2
Distro: Fedora Rawhide
Hardware probe: https://linux-hardware.org/?probe=99f5cf44a4
Kernel config and full dmesg are attached.
Reproducer (two ways)
Launch Steam and trigger shader compilation (automatic background
pre-cache or any action that compiles shaders).
Alternatively, launch Halo Infinite and start a campaign. Within a
short time the GPU hang occurs and gpu_sched prints repeated
*ERROR* Trying to push to a killed entity, followed by a blocked
ttm_bo_delayed_delete worker as in the trace above.
Impact / notes
This is a runtime GPU hang on a current RDNA3 card; the offending
commit is CC’d to stable, so it would be good to fix or revert before
it propagates.
Thanks for looking into this.
--
Best Regards,
Mike Gavrilov.
-------------- next part --------------
A non-text attachment was scrubbed...
Name: .config.zip
Type: application/zip
Size: 70245 bytes
Desc: not available
URL: <https://lists.freedesktop.org/archives/amd-gfx/attachments/20250818/0ff2ce50/attachment-0002.zip>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: dmesg-6.17.0-0.rc0.250801g89748acdf226.7.fc43.x86_64+debug.zip
Type: application/zip
Size: 51673 bytes
Desc: not available
URL: <https://lists.freedesktop.org/archives/amd-gfx/attachments/20250818/0ff2ce50/attachment-0003.zip>
More information about the amd-gfx
mailing list