[PATCH v2] drm/scheduler: Fix sched hang when killing app with dependent jobs

Tue Jul 15 10:27:10 UTC 2025

On Tue, 2025-07-15 at 09:51 +0000, cao, lin wrote:
> 
> [AMD Official Use Only - AMD Internal Distribution Only]
> 
> 
> 
> Hi Philipp,
> 
> 
> Thanks for your review, let me try to clarify why I added drm_sched_wakeup() to drm_sched_entity_kill_jobs_work():
> 
> 
> When application A submits jobs (a1, a2, a3) and application B submits job b1 with a dependency on a1's scheduled fence, the normal execution flow is (function drm_sched_run_job_work()):
> 1. a1 gets popped from the entity by the scheduler
> 2. run_job(a1) executes
> 3. a1's scheduled fence gets signaled 
> 4. drm_sched_run_job_work() calls drm_sched_run_job_queue() at the end
> 5. The scheduler wakes up and re-selects entities to pop jobs
> 6. Since b1's dependency is cleared, the scheduler can select b1 and continue the same flow
> 
> 
> However, if application A is killed before a1 gets popped by the scheduler, then a1, a2, a3 are killed sequentially by drm_sched_entity_kill_jobs_cb(). During the kill process, their scheduled fences are still signaled, but the kill process itself lacks work_run_job. This means b1's dependency gets cleared, but there's no work_run_job to drive the scheduler to continue running, or we can saystep 4 in the normal flow is missing, and the scheduler enters sleep state, which causes application B to hang.
> So if we add drm_sched_wakeup() in drm_sched_entity_kill_jobs_work() after drm_sched_fence_scheduled(), the scheduler can be woken up, and application B can continue running after a1's scheduled fence is force signaled.

Thanks for the detailed explanation. Makes sense.
Maybe you can detail in v3 that this "optimization" consists of the
work item not being scheduled. I think that was the piece of the puzzle
I was missing.

I / DRM tools will also include a link to this thread, so I think that
will then be sufficient.

Thx
P.

> 
> Thanks,
> Lin
> 
> 
> 
> 
> 
> From: Philipp Stanner <phasta at mailbox.org>
> Sent: Tuesday, July 15, 2025 17:12
> To: cao, lin <lin.cao at amd.com>; Koenig, Christian <Christian.Koenig at amd.com>; phasta at kernel.org <phasta at kernel.org>; dri-devel at lists.freedesktop.org <dri-devel at lists.freedesktop.org>
> Cc: Yin, ZhenGuo (Chris) <ZhenGuo.Yin at amd.com>; Deng, Emily <Emily.Deng at amd.com>; dakr at kernel.org <dakr at kernel.org>; matthew.brost at intel.com <matthew.brost at intel.com>
> Subject: Re: [PATCH v2] drm/scheduler: Fix sched hang when killing app with dependent jobs
> 
> 
> On Tue, 2025-07-15 at 03:38 +0000, cao, lin wrote:
> > 
> > [AMD Official Use Only - AMD Internal Distribution Only]
> > 
> > 
> > 
> > Hi Christian, Philipp,
> > 
> > 
> > I modified the commit msg, and I hope it makes more sense:
> > 
> > 
> > drm/sched: wake up scheduler when killing jobs to prevent hang
> 
> nit:
> s/wake/Wake
> 
> > 
> > 
> > When application A submits jobs (a1, a2, a3) and application B submits
> > job b1 with a dependency on a2's scheduler fence, killing application A
> > before run_job(a1) causes drm_sched_entity_kill_jobs_work() to force
> > signal all jobs sequentially. However, the optimization in
> > drm_sched_entity_add_dependency_cb() waits for the fence to be scheduled
> > by adding a callback (drm_sched_entity_clear_dep) instead of immediately
> > waking up the scheduler. When A is killed before its jobs can run, the
> > callback gets triggered but drm_sched_entity_clear_dep() only clears the
> > dependency without waking up the scheduler, causing the scheduler to enter
> > sleep state and application B to hang.
> 
> That now reads as if drm_sched_entity_clear_dep() is supposed to wake
> up the scheduler. But you add the wakeup to a different function (the
> work item).
> 
> Also the code actually looks like that:
> 
> 
>                 xa_erase(&job->dependencies, index);
>                 if (f && !dma_fence_add_callback(f, &job->finish_cb,
>                                                  drm_sched_entity_kill_jobs_cb))
>                         return;
> 
>                 dma_fence_put(f);
>         }
> 
>         INIT_WORK(&job->work, drm_sched_entity_kill_jobs_work);
>         schedule_work(&job->work);
> }
> 
> So if adding that callback succeeds we.. return immediately but we.. do
> that for the first dependency, not for all of them?
> 
> Oh boy. That code base. We(tm) should look into pimping that up. Also
> lacks documentation everywhere.
> 
> 
> Anyways. If this solves a bug for you the patch looks fine (i.e., not
> potentially harmful) by me and if the tests succeed we can merge it.
> However, I'd feel better if you could clarify more why that function is
> the right place to solve the bug.
> 
> 
> Thanks,
> P.
> 
> 
> > 
> > 
> > Add drm_sched_wakeup() in entity_kill_job_work() to prevent scheduler
> > sleep and subsequent application hangs.
> > 
> > 
> > v2:
> > - Move drm_sched_wakeup() to after drm_sched_fence_scheduled()
> > 
> > 
> > Thanks,
> > Lin
> > 
> > 
> > From: Koenig, Christian <Christian.Koenig at amd.com>
> > Sent: Monday, July 14, 2025 21:39
> > To: phasta at kernel.org <phasta at kernel.org>; cao, lin <lin.cao at amd.com>; dri-devel at lists.freedesktop.org <dri-devel at lists.freedesktop.org>
> > Cc: Yin, ZhenGuo (Chris) <ZhenGuo.Yin at amd.com>; Deng, Emily <Emily.Deng at amd.com>; dakr at kernel.org <dakr at kernel.org>; matthew.brost at intel.com <matthew.brost at intel.com>
> > Subject: Re: [PATCH v2] drm/scheduler: Fix sched hang when killing app with dependent jobs
> > 
> >  
> > 
> > 
> > On 14.07.25 15:27, Philipp Stanner wrote:
> > > On Mon, 2025-07-14 at 15:08 +0200, Christian König wrote:
> > > > 
> > > > 
> > > > On 14.07.25 14:46, Philipp Stanner wrote:
> > > > > regarding the patch subject: the prefix we use for the scheduler
> > > > > is:
> > > > > drm/sched:
> > > > > 
> > > > > 
> > > > > On Mon, 2025-07-14 at 14:23 +0800, Lin.Cao wrote:
> > > > > 
> > > > > > When Application A submits jobs (a1, a2, a3) and application B
> > > > > > submits
> > > > > 
> > > > > s/Application/application
> > > > > 
> > > > > > job b1 with a dependency on a2's scheduler fence, killing
> > > > > > application A
> > > > > > before run_job(a1) causes drm_sched_entity_kill_jobs_work() to
> > > > > > force
> > > > > > signal all jobs sequentially. However, due to missing
> > > > > > work_run_job or
> > > > > > work_free_job
> > > > > > 
> > > > > 
> > > > > You probably mean "due to the work items work_run_job and
> > > > > work_free_job
> > > > > not being started […]".
> > > > > 
> > > > > However, the call you add, drm_sched_wakeup(), immediately only
> > > > > triggers work_run_job. It's not clear to me why you mention
> > > > > work_free_job, because drm_sched_entity_kill_jobs_work() directly
> > > > > invokes the free_job() callback.
> > > > 
> > > > Yeah the description is rather confusing, took me more than one try
> > > > to understand what's going on here as well. Let me try to explain it
> > > > in my words:
> > > > 
> > > > When an application A is killed the pending submissions from it are
> > > > not executed, but rather torn down by
> > > > drm_sched_entity_kill_jobs_work().
> > > > 
> > > > If now a submission from application B depends on something
> > > > application A wanted to do on the same scheduler instance the
> > > > function drm_sched_entity_add_dependency_cb() would have optimized
> > > > this dependency and assumed that the scheduler work would already run
> > > > and try to pick a job.
> > > > 
> > > > But that isn't the case when the submissions from application A are
> > > > all killed. So make sure to start the scheduler work from
> > > > drm_sched_entity_kill_jobs_work() to handle that case.
> > > 
> > > Alright, so the bug then depends on B setting a dependency on A _after_
> > > A was killed, doesn't it? Because the optimization in
> > > _add_dependency_cb() can only have an effect after A's jobs have been
> > > torn down.
> > 
> > No, application A and application B submit right behind each other. Application A is then killed later on.
> > 
> > The optimization in _add_dependency_cb() just waits for As submission to be handled by the scheduler, but that never happens because it was killed.
> > 
> > > If there is such a timing order problem, the commit message should
> > > mention it.
> > > 
> > > The commit message definitely also needs to mention this optimization
> > > in drm_sched_entity_add_dependency_cb() since it's essential for
> > > understanding the bug.
> > 
> > +1
> > 
> > Christian.
> > 
> > > 
> > > 
> > > Danke
> > > P.
> > > 
> > > 
> > > > 
> > > > Regards,
> > > > Christian.
> > > > 
> > > > > 
> > > > > >  in entity_kill_job_work(), the scheduler enters sleep
> > > > > > state, causing application B hang.
> > > > > 
> > > > > So the issue is that if a1 never ran, the scheduler will not
> > > > > continue
> > > > > with the jobs of application B, correct?
> > > > > 
> > > > > > 
> > > > > > Add drm_sched_wakeup() when entity_kill_job_work() to preventing
> > > > > 
> > > > > s/to preventing/to prevent
> > > > > 
> > > > > > scheduler sleep and subsequent application hangs.
> > > > > > 
> > > > > > v2:
> > > > > > - Move drm_sched_wakeup() to after drm_sched_fence_scheduled()
> > > > > 
> > > > > Changelogs are cool and useful, but move them below the Signed-off
> > > > > area
> > > > > with another --- please, like so:
> > > > > 
> > > > > Signed-off by: …
> > > > > ---
> > > > > v2:
> > > > > …
> > > > > ---
> > > > > drivers/gpu/drm/scheduler/sched_entity.c | 1 +
> > > > > 
> > > > > 
> > > > > > 
> > > > > > Signed-off-by: Lin.Cao <lincao12 at amd.com>
> > > > > 
> > > > > This fixes a bug. Even a racy bug, it seems. So we need a Fixes tag
> > > > > and
> > > > > put the stable kernel in CC.
> > > > > 
> > > > > Please figure out with git blame, git log and git tag --contains
> > > > > which
> > > > > commit your patch fixes and which is the oldest kernel that
> > > > > contains
> > > > > the bug. Then add a Fixes: tag and Cc: the stable kernel. You'll
> > > > > find
> > > > > lots of examples for that in git log.
> > > > > 
> > > > > 
> > > > > Thx
> > > > > P.
> > > > > 
> > > > > > ---
> > > > > >  drivers/gpu/drm/scheduler/sched_entity.c | 1 +
> > > > > >  1 file changed, 1 insertion(+)
> > > > > > 
> > > > > > diff --git a/drivers/gpu/drm/scheduler/sched_entity.c
> > > > > > b/drivers/gpu/drm/scheduler/sched_entity.c
> > > > > > index e671aa241720..66f2a43c58fd 100644
> > > > > > --- a/drivers/gpu/drm/scheduler/sched_entity.c
> > > > > > +++ b/drivers/gpu/drm/scheduler/sched_entity.c
> > > > > > @@ -177,6 +177,7 @@ static void
> > > > > > drm_sched_entity_kill_jobs_work(struct work_struct *wrk)
> > > > > >    struct drm_sched_job *job = container_of(wrk,
> > > > > > typeof(*job), work);
> > > > > >  
> > > > > >    drm_sched_fence_scheduled(job->s_fence, NULL);
> > > > > > +  drm_sched_wakeup(job->sched);
> > > > > >    drm_sched_fence_finished(job->s_fence, -ESRCH);
> > > > > >    WARN_ON(job->s_fence->parent);
> > > > > >    job->sched->ops->free_job(job);
> > > > > 
> > > > 
> > > 
> > 
>