REGRESSION drm/amdgpu: Radeon 7900 XTX hang & gpu_sched "Trying to push to a killed entity" since 1f02f2044bda (6.17-rc)

Mikhail Gavrilov mikhail.v.gavrilov at gmail.com
Mon Aug 18 14:32:10 UTC 2025


Hi Gang,

Between commits 4b290aae788e and 89748acdf226 my Radeon 7900 XTX
starts hanging when Steam performs shader compilation, with the
following messages/stack trace:

[ 9254.082549] kworker/u129:2 (15855) used greatest stack depth: 19656
bytes left
[ 9435.589185] [drm:drm_sched_entity_push_job [gpu_sched]] *ERROR*
Trying to push to a killed entity
[ 9435.590340] [drm:drm_sched_entity_push_job [gpu_sched]] *ERROR*
Trying to push to a killed entity
[ 9435.590465] [drm:drm_sched_entity_push_job [gpu_sched]] *ERROR*
Trying to push to a killed entity
[ 9435.590881] [drm:drm_sched_entity_push_job [gpu_sched]] *ERROR*
Trying to push to a killed entity
[ 9435.592513] [drm:drm_sched_entity_push_job [gpu_sched]] *ERROR*
Trying to push to a killed entity
[ 9435.594059] [drm:drm_sched_entity_push_job [gpu_sched]] *ERROR*
Trying to push to a killed entity
[ 9435.596428] [drm:drm_sched_entity_push_job [gpu_sched]] *ERROR*
Trying to push to a killed entity
[ 9435.597828] [drm:drm_sched_entity_push_job [gpu_sched]] *ERROR*
Trying to push to a killed entity
[ 9585.848993] INFO: task kworker/u132:12:18278 blocked for more than
122 seconds.
[ 9585.849006]       Tainted: G             L     ------  ---
6.17.0-0.rc0.250801g89748acdf226.7.fc43.x86_64+debug #1
[ 9585.849010] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs"
disables this message.
[ 9585.849013] task:kworker/u132:12 state:D stack:28600 pid:18278
tgid:18278 ppid:2      task_flags:0x4208060 flags:0x00004000
[ 9585.849022] Workqueue: ttm ttm_bo_delayed_delete [ttm]
[ 9585.849032] Call Trace:
[ 9585.849034]  <TASK>
[ 9585.849037]  __schedule+0x8d2/0x1be0
[ 9585.849044]  ? __pfx___schedule+0x10/0x10
[ 9585.849051]  ? __lock_release.isra.0+0x1cb/0x340
[ 9585.849059]  schedule+0xd4/0x260
[ 9585.849062]  schedule_timeout+0x17f/0x260
[ 9585.849065]  ? __pfx_schedule_timeout+0x10/0x10
[ 9585.849067]  ? find_held_lock+0x2b/0x80
[ 9585.849074]  ? lockdep_hardirqs_on_prepare.part.0+0x92/0x170
[ 9585.849076]  ? trace_hardirqs_on+0x18/0x150
[ 9585.849081]  dma_fence_default_wait+0x472/0x700
[ 9585.849087]  ? find_held_lock+0x2b/0x80
[ 9585.849089]  ? __pfx_dma_fence_default_wait+0x10/0x10
[ 9585.849092]  ? __pfx_dma_fence_default_wait_cb+0x10/0x10
[ 9585.849095]  ? mark_held_locks+0x40/0x70
[ 9585.849098]  ? lockdep_hardirqs_on_prepare.part.0+0x92/0x170
[ 9585.849103]  dma_fence_wait_timeout+0x344/0x540
[ 9585.849107]  dma_resv_wait_timeout+0xeb/0x190
[ 9585.849111]  ? __pfx_dma_resv_wait_timeout+0x10/0x10
[ 9585.849117]  ? rcu_is_watching+0x15/0xe0
[ 9585.849122]  ttm_bo_delayed_delete+0x34/0x100 [ttm]
[ 9585.849128]  process_one_work+0x87a/0x14d0
[ 9585.849140]  ? __pfx_process_one_work+0x10/0x10
[ 9585.849145]  ? find_held_lock+0x2b/0x80
[ 9585.849153]  ? assign_work+0x156/0x390
[ 9585.849161]  worker_thread+0x5f2/0xfd0
[ 9585.849172]  ? __pfx_worker_thread+0x10/0x10
[ 9585.849175]  kthread+0x3b0/0x770
[ 9585.849178]  ? local_clock_noinstr+0xf/0x130
[ 9585.849182]  ? __pfx_kthread+0x10/0x10
[ 9585.849186]  ? rcu_is_watching+0x15/0xe0
[ 9585.849188]  ? __pfx_kthread+0x10/0x10
[ 9585.849192]  ret_from_fork+0x3ef/0x510
[ 9585.849196]  ? __pfx_kthread+0x10/0x10
[ 9585.849198]  ? __pfx_kthread+0x10/0x10
[ 9585.849201]  ret_from_fork_asm+0x1a/0x30
[ 9585.849210]  </TASK>

I can also reproduce the same error when starting a campaign in Halo
Infinite (without relying on Steam’s shader pre-cache UI), which made
bisecting feasible.

1f02f2044bda1db1fd995bc35961ab075fa7b5a2 is the first bad commit
commit 1f02f2044bda1db1fd995bc35961ab075fa7b5a2 (HEAD)
Author: Gang Ba <Gang.Ba at amd.com>
Date:   Tue Jul 8 14:36:13 2025 -0400

    drm/amdgpu: Avoid extra evict-restore process.

    If vm belongs to another process, this is fclose after fork,
    wait may enable signaling KFD eviction fence and cause parent
process queue evicted.

    [677852.634569]  amdkfd_fence_enable_signaling+0x56/0x70 [amdgpu]
    [677852.634814]  __dma_fence_enable_signaling+0x3e/0xe0
    [677852.634820]  dma_fence_wait_timeout+0x3a/0x140
    [677852.634825]  amddma_resv_wait_timeout+0x7f/0xf0 [amdkcl]
    [677852.634831]  amdgpu_vm_wait_idle+0x2d/0x60 [amdgpu]
    [677852.635026]  amdgpu_flush+0x34/0x50 [amdgpu]
    [677852.635208]  filp_flush+0x38/0x90
    [677852.635213]  filp_close+0x14/0x30
    [677852.635216]  do_close_on_exec+0xdd/0x130
    [677852.635221]  begin_new_exec+0x1da/0x490
    [677852.635225]  load_elf_binary+0x307/0xea0
    [677852.635231]  ? srso_alias_return_thunk+0x5/0xfbef5
    [677852.635235]  ? ima_bprm_check+0xa2/0xd0
    [677852.635240]  search_binary_handler+0xda/0x260
    [677852.635245]  exec_binprm+0x58/0x1a0
    [677852.635249]  bprm_execve.part.0+0x16f/0x210
    [677852.635254]  bprm_execve+0x45/0x80
    [677852.635257]  do_execveat_common.isra.0+0x190/0x200

    Suggested-by: Christian König <christian.koenig at amd.com>
    Signed-off-by: Gang Ba <Gang.Ba at amd.com>
    Reviewed-by: Christian König <christian.koenig at amd.com>
    Signed-off-by: Alex Deucher <alexander.deucher at amd.com>
    Cc: stable at vger.kernel.org

 drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c | 6 ++----
 1 file changed, 2 insertions(+), 4 deletions(-)

Reverting 1f02f2044bda on top of 6.17-rc2 fully eliminates the hang on
my system.

Environment:
    GPU: AMD Radeon 7900 XTX
    Kernel: 6.17-rc2
    Distro: Fedora Rawhide
    Hardware probe: https://linux-hardware.org/?probe=99f5cf44a4
    Kernel config and full dmesg are attached.

Reproducer (two ways)
    Launch Steam and trigger shader compilation (automatic background
pre-cache or any action that compiles shaders).
    Alternatively, launch Halo Infinite and start a campaign. Within a
short time the GPU hang occurs and gpu_sched prints repeated
*ERROR* Trying to push to a killed entity, followed by a blocked
ttm_bo_delayed_delete worker as in the trace above.

Impact / notes
    This is a runtime GPU hang on a current RDNA3 card; the offending
commit is CC’d to stable, so it would be good to fix or revert before
it propagates.

Thanks for looking into this.

-- 
Best Regards,
Mike Gavrilov.
-------------- next part --------------
A non-text attachment was scrubbed...
Name: .config.zip
Type: application/zip
Size: 70245 bytes
Desc: not available
URL: <https://lists.freedesktop.org/archives/amd-gfx/attachments/20250818/0ff2ce50/attachment-0002.zip>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: dmesg-6.17.0-0.rc0.250801g89748acdf226.7.fc43.x86_64+debug.zip
Type: application/zip
Size: 51673 bytes
Desc: not available
URL: <https://lists.freedesktop.org/archives/amd-gfx/attachments/20250818/0ff2ce50/attachment-0003.zip>


More information about the amd-gfx mailing list