[PATCH] drm/amdgpu: fix task hang from failed job submission during process kill

Tue Aug 12 12:52:23 UTC 2025

On Tue, 2025-08-12 at 08:58 +0200, Christian König wrote:
> On 12.08.25 08:37, Liu01, Tong (Esther) wrote:
> > [AMD Official Use Only - AMD Internal Distribution Only]
> > 
> > Hi Christian,
> > 
> > If a job is submitted into a stopped entity, in addition to an error log, it will also cause task to hang and timeout
> 
> Oh that's really ugly and needs to get fixed.

And we agree that the proposed fix is to stop the driver from
submitting to killed entities, don't we?

P.

> 
> > , and subsequently generate a call trace since the fence of the submitted job is not signaled. Moreover, the refcnt of amdgpu will not decrease because process killing fails, resulting in the inability to unload amdgpu.
> > 
> > [Tue Aug  5 11:05:20 2025] [drm:amddrm_sched_entity_push_job [amd_sched]] *ERROR* Trying to push to a killed entity
> > [Tue Aug  5 11:07:43 2025] INFO: task kworker/u17:0:117 blocked for more than 122 seconds.
> > [Tue Aug  5 11:07:43 2025]       Tainted: G           OE      6.8.0-45-generic #45-Ubuntu
> > [Tue Aug  5 11:07:43 2025] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
> > [Tue Aug  5 11:07:43 2025] task:kworker/u17:0   state:D stack:0     pid:117   tgid:117   ppid:2      flags:0x00004000
> > [Tue Aug  5 11:07:43 2025] Workqueue: ttm ttm_bo_delayed_delete [amdttm]
> > [Tue Aug  5 11:07:43 2025] Call Trace:
> > [Tue Aug  5 11:07:43 2025]  <TASK>
> > [Tue Aug  5 11:07:43 2025]  __schedule+0x27c/0x6b0
> > [Tue Aug  5 11:07:43 2025]  schedule+0x33/0x110
> > [Tue Aug  5 11:07:43 2025]  schedule_timeout+0x157/0x170
> > [Tue Aug  5 11:07:43 2025]  dma_fence_default_wait+0x1e1/0x220
> > [Tue Aug  5 11:07:43 2025]  ? __pfx_dma_fence_default_wait_cb+0x10/0x10
> > [Tue Aug  5 11:07:43 2025]  dma_fence_wait_timeout+0x116/0x140
> > [Tue Aug  5 11:07:43 2025]  amddma_resv_wait_timeout+0x7f/0xf0 [amdkcl]
> > [Tue Aug  5 11:07:43 2025]  ttm_bo_delayed_delete+0x2a/0xc0 [amdttm]
> > [Tue Aug  5 11:07:43 2025]  process_one_work+0x16f/0x350
> > [Tue Aug  5 11:07:43 2025]  worker_thread+0x306/0x440
> > [Tue Aug  5 11:07:43 2025]  ? __pfx_worker_thread+0x10/0x10
> > [Tue Aug  5 11:07:43 2025]  kthread+0xf2/0x120
> > [Tue Aug  5 11:07:43 2025]  ? __pfx_kthread+0x10/0x10
> > [Tue Aug  5 11:07:43 2025]  ret_from_fork+0x47/0x70
> > [Tue Aug  5 11:07:43 2025]  ? __pfx_kthread+0x10/0x10
> > [Tue Aug  5 11:07:43 2025]  ret_from_fork_asm+0x1b/0x30
> > [Tue Aug  5 11:07:43 2025]  </TASK>
> > 
> > Checking vm entity stopped or not in amdgpu_vm_ready() can avoid to submit job to stopped entity. But as I understand it there still has risk of memory leaks and resource leaks since amdgpu_vm_clear_freed() is skipped during killing process. In amdgpu_vm_clear_freed() , it will update page table to remove mappings and free the mapping structures. If this clean up is skipped, the page table entries remain in VRAM pointing to freed buffer object and mapping structures are allocated but not freed. Please correct me if I have any misunderstanding.
> 
> No your understanding is correct, but that page tables are not cleared is completely harmless.
> 
> The application is killed and can't submit anything any more. We should just make sure that we check amdgpu_vm_ready() in the submit path as well.
> 
> Regards,
> Christian.
> 
> > 
> > Kind regards,
> > Esther
> > 
> > -----Original Message-----
> > From: Koenig, Christian <Christian.Koenig at amd.com>
> > Sent: Monday, August 11, 2025 8:17 PM
> > To: Liu01, Tong (Esther) <Tong.Liu01 at amd.com>; dri-devel at lists.freedesktop.org
> > Cc: phasta at kernel.org; dakr at kernel.org; matthew.brost at intel.com; Ba, Gang <Gang.Ba at amd.com>; matthew.schwartz at linux.dev; cao, lin <lin.cao at amd.com>
> > Subject: Re: [PATCH] drm/amdgpu: fix task hang from failed job submission during process kill
> > 
> > Hi Esther,
> > 
> > but that is harmless and potentially only gives a warning in the system log.
> > 
> > You could adjust amdgpu_vm_ready() if necessary.
> > 
> > Regards,
> > Christian.
> > 
> > On 11.08.25 11:05, Liu01, Tong (Esther) wrote:
> > > [AMD Official Use Only - AMD Internal Distribution Only]
> > > 
> > > Hi Christian,
> > > 
> > > The real issue is a race condition during process exit after patch https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=1f02f2044bda1db1fd995bc35961ab075fa7b5a2. This patch changed amdgpu_vm_wait_idle to use drm_sched_entity_flush instead of dma_resv_wait_timeout. Here is what happens:
> > > 
> > > do_exit
> > >     |
> > >     exit_files(tsk) ... amdgpu_flush ... amdgpu_vm_wait_idle ... drm_sched_entity_flush (kills entity)
> > >     ...
> > >     exit_task_work(tsk) ...amdgpu_gem_object_close  ...
> > > amdgpu_vm_clear_freed (tries to submit to killed entity)
> > > 
> > > The entity gets killed in amdgpu_vm_wait_idle(), but amdgpu_vm_clear_freed() called by exit_task_work() still tries to submit jobs.
> > > 
> > > Kind regards,
> > > Esther
> > > 
> > > -----Original Message-----
> > > From: Koenig, Christian <Christian.Koenig at amd.com>
> > > Sent: Monday, August 11, 2025 4:25 PM
> > > To: Liu01, Tong (Esther) <Tong.Liu01 at amd.com>;
> > > dri-devel at lists.freedesktop.org
> > > Cc: phasta at kernel.org; dakr at kernel.org; matthew.brost at intel.com; Ba,
> > > Gang <Gang.Ba at amd.com>; matthew.schwartz at linux.dev; cao, lin
> > > <lin.cao at amd.com>; cao, lin <lin.cao at amd.com>
> > > Subject: Re: [PATCH] drm/amdgpu: fix task hang from failed job
> > > submission during process kill
> > > 
> > > On 11.08.25 09:20, Liu01 Tong wrote:
> > > > During process kill, drm_sched_entity_flush() will kill the vm
> > > > entities. The following job submissions of this process will fail
> > > 
> > > Well when the process is killed how can it still make job submissions?
> > > 
> > > Regards,
> > > Christian.
> > > 
> > > > , and
> > > > the resources of these jobs have not been released, nor have the
> > > > fences  been signalled, causing tasks to hang.
> > > > 
> > > > Fix by not doing job init when the entity is stopped. And when the
> > > > job is already submitted, free the job resource if the entity is stopped.
> > > > 
> > > > Signed-off-by: Liu01 Tong <Tong.Liu01 at amd.com>
> > > > Signed-off-by: Lin.Cao <lincao12 at amd.com>
> > > > ---
> > > >  drivers/gpu/drm/scheduler/sched_entity.c | 13 +++++++------
> > > >  drivers/gpu/drm/scheduler/sched_main.c   |  5 +++++
> > > >  2 files changed, 12 insertions(+), 6 deletions(-)
> > > > 
> > > > diff --git a/drivers/gpu/drm/scheduler/sched_entity.c
> > > > b/drivers/gpu/drm/scheduler/sched_entity.c
> > > > index ac678de7fe5e..1e744b2eb2db 100644
> > > > --- a/drivers/gpu/drm/scheduler/sched_entity.c
> > > > +++ b/drivers/gpu/drm/scheduler/sched_entity.c
> > > > @@ -570,6 +570,13 @@ void drm_sched_entity_push_job(struct drm_sched_job *sched_job)
> > > >       bool first;
> > > >       ktime_t submit_ts;
> > > > 
> > > > +     if (entity->stopped) {
> > > > +             DRM_ERROR("Trying to push job to a killed entity\n");
> > > > +             INIT_WORK(&sched_job->work, drm_sched_entity_kill_jobs_work);
> > > > +             schedule_work(&sched_job->work);
> > > > +             return;
> > > > +     }
> > > > +
> > > >       trace_drm_sched_job(sched_job, entity);
> > > >       atomic_inc(entity->rq->sched->score);
> > > >       WRITE_ONCE(entity->last_user, current->group_leader); @@
> > > > -589,12
> > > > +596,6 @@ void drm_sched_entity_push_job(struct drm_sched_job
> > > > *sched_job)
> > > > 
> > > >               /* Add the entity to the run queue */
> > > >               spin_lock(&entity->lock);
> > > > -             if (entity->stopped) {
> > > > -                     spin_unlock(&entity->lock);
> > > > -
> > > > -                     DRM_ERROR("Trying to push to a killed entity\n");
> > > > -                     return;
> > > > -             }
> > > > 
> > > >               rq = entity->rq;
> > > >               sched = rq->sched;
> > > > diff --git a/drivers/gpu/drm/scheduler/sched_main.c
> > > > b/drivers/gpu/drm/scheduler/sched_main.c
> > > > index bfea608a7106..c15b17d9ffe3 100644
> > > > --- a/drivers/gpu/drm/scheduler/sched_main.c
> > > > +++ b/drivers/gpu/drm/scheduler/sched_main.c
> > > > @@ -795,6 +795,11 @@ int drm_sched_job_init(struct drm_sched_job *job,
> > > >               return -ENOENT;
> > > >       }
> > > > 
> > > > +     if (unlikely(entity->stopped)) {
> > > > +             pr_err("*ERROR* %s: entity is stopped!\n", __func__);
> > > > +             return -EINVAL;
> > > > +     }
> > > > +
> > > >       if (unlikely(!credits)) {
> > > >               pr_err("*ERROR* %s: credits cannot be 0!\n", __func__);
> > > >               return -EINVAL;
> > > 
> > 
>