[PATCH v2] drm/scheduler: Fix sched hang when killing app with dependent jobs

Tue Jul 15 13:20:50 UTC 2025

[AMD Official Use Only - AMD Internal Distribution Only]

Hi Philipp, Christian,

Thanks for the kind words! :)

I've reconsidered the approach and decided to go with the "removing code" solution instead. I'll drop the v3 patch and submit a new v4 that removes the problematic optimization in drm_sched_entity_add_dependency_cb().

Thanks,
Lin
________________________________
From: Philipp Stanner <phasta at mailbox.org>
Sent: Tuesday, July 15, 2025 21:07
To: Koenig, Christian <Christian.Koenig at amd.com>; phasta at kernel.org <phasta at kernel.org>; cao, lin <lin.cao at amd.com>; dri-devel at lists.freedesktop.org <dri-devel at lists.freedesktop.org>
Cc: Yin, ZhenGuo (Chris) <ZhenGuo.Yin at amd.com>; Deng, Emily <Emily.Deng at amd.com>; dakr at kernel.org <dakr at kernel.org>; matthew.brost at intel.com <matthew.brost at intel.com>
Subject: Re: [PATCH v2] drm/scheduler: Fix sched hang when killing app with dependent jobs

On Tue, 2025-07-15 at 14:32 +0200, Christian König wrote:
> On 15.07.25 14:20, Philipp Stanner wrote:
> > On Tue, 2025-07-15 at 12:52 +0200, Christian König wrote:
> > > On 15.07.25 12:27, Philipp Stanner wrote:
> > > > On Tue, 2025-07-15 at 09:51 +0000, cao, lin wrote:
> > > > >
> > > > > [AMD Official Use Only - AMD Internal Distribution Only]
> > > > >
> > > > >
> > > > >
> > > > > Hi Philipp,
> > > > >
> > > > >
> > > > > Thanks for your review, let me try to clarify why I added
> > > > > drm_sched_wakeup() to drm_sched_entity_kill_jobs_work():
> > > > >
> > > > >
> > > > > When application A submits jobs (a1, a2, a3) and application
> > > > > B
> > > > > submits job b1 with a dependency on a1's scheduled fence, the
> > > > > normal execution flow is (function drm_sched_run_job_work()):
> > > > > 1. a1 gets popped from the entity by the scheduler
> > > > > 2. run_job(a1) executes
> > > > > 3. a1's scheduled fence gets signaled
> > > > > 4. drm_sched_run_job_work() calls drm_sched_run_job_queue()
> > > > > at
> > > > > the end
> > > > > 5. The scheduler wakes up and re-selects entities to pop jobs
> > > > > 6. Since b1's dependency is cleared, the scheduler can select
> > > > > b1
> > > > > and continue the same flow
> > > > >
> > > > >
> > > > > However, if application A is killed before a1 gets popped by
> > > > > the
> > > > > scheduler, then a1, a2, a3 are killed sequentially by
> > > > > drm_sched_entity_kill_jobs_cb(). During the kill process,
> > > > > their
> > > > > scheduled fences are still signaled, but the kill process
> > > > > itself
> > > > > lacks work_run_job. This means b1's dependency gets cleared,
> > > > > but
> > > > > there's no work_run_job to drive the scheduler to continue
> > > > > running, or we can saystep 4 in the normal flow is missing,
> > > > > and
> > > > > the scheduler enters sleep state, which causes application B
> > > > > to
> > > > > hang.
> > > > > So if we add drm_sched_wakeup() in
> > > > > drm_sched_entity_kill_jobs_work() after
> > > > > drm_sched_fence_scheduled(), the scheduler can be woken up,
> > > > > and
> > > > > application B can continue running after a1's scheduled fence
> > > > > is
> > > > > force signaled.
> > > >
> > > > Thanks for the detailed explanation. Makes sense.
> > > > Maybe you can detail in v3 that this "optimization" consists of
> > > > the
> > > > work item not being scheduled.
> > >
> > > Yeah, removing this "optimization" would also be a valid solution
> > > to
> > > the problem.
> >
> > If you at AMD are willing to work on that that'd be definitely
> > fine.
> > Removing code is usually better than adding code.
>
> I fear I won't have time for that before my retirement :)

You've got fresh, powerful folks like Lin! :)

But either solution is fine. If you just want it fixed, we can merge
the existing approach.

>
> > Do you think being afraid of a performance regression is realistic
> > here, though?
>
> No, absolutely not. It was just a micro optimization done long ago.
>
> On the other hand with the scheduler code base I'm not sure of
> anything any more.

"In my subsystem you have to run as fast as you can just to remain at
the same place", said the Red Queen to Alice.

:)

P.

>
> Regards,
> Christian.
>
> >
> >
> > P.
> >
> > >
> > > Christian.
> > >
> > > > I think that was the piece of the puzzle
> > > > I was missing.
> > > >
> > > > I / DRM tools will also include a link to this thread, so I
> > > > think
> > > > that
> > > > will then be sufficient.
> > > >
> > > > Thx
> > > > P.
> > > >
> > > > >
> > > > > Thanks,
> > > > > Lin
> > > > >
> > > > >
> > > > >
> > > > >
> > > > >
> > > > > From: Philipp Stanner <phasta at mailbox.org>
> > > > > Sent: Tuesday, July 15, 2025 17:12
> > > > > To: cao, lin <lin.cao at amd.com>; Koenig, Christian
> > > > > <Christian.Koenig at amd.com>;
> > > > > phasta at kernel.org <phasta at kernel.org>;
> > > > > dri-devel at lists.freedesktop.org <
> > > > > dri-devel at lists.freedesktop.org>
> > > > > Cc: Yin, ZhenGuo (Chris) <ZhenGuo.Yin at amd.com>; Deng, Emily
> > > > > <Emily.Deng at amd.com>; dakr at kernel.org <dakr at kernel.org>;
> > > > > matthew.brost at intel.com <matthew.brost at intel.com>
> > > > > Subject: Re: [PATCH v2] drm/scheduler: Fix sched hang when
> > > > > killing app with dependent jobs
> > > > >
> > > > >
> > > > > On Tue, 2025-07-15 at 03:38 +0000, cao, lin wrote:
> > > > > >
> > > > > > [AMD Official Use Only - AMD Internal Distribution Only]
> > > > > >
> > > > > >
> > > > > >
> > > > > > Hi Christian, Philipp,
> > > > > >
> > > > > >
> > > > > > I modified the commit msg, and I hope it makes more sense:
> > > > > >
> > > > > >
> > > > > > drm/sched: wake up scheduler when killing jobs to prevent
> > > > > > hang
> > > > >
> > > > > nit:
> > > > > s/wake/Wake
> > > > >
> > > > > >
> > > > > >
> > > > > > When application A submits jobs (a1, a2, a3) and
> > > > > > application B
> > > > > > submits
> > > > > > job b1 with a dependency on a2's scheduler fence, killing
> > > > > > application A
> > > > > > before run_job(a1) causes drm_sched_entity_kill_jobs_work()
> > > > > > to
> > > > > > force
> > > > > > signal all jobs sequentially. However, the optimization in
> > > > > > drm_sched_entity_add_dependency_cb() waits for the fence to
> > > > > > be
> > > > > > scheduled
> > > > > > by adding a callback (drm_sched_entity_clear_dep) instead
> > > > > > of
> > > > > > immediately
> > > > > > waking up the scheduler. When A is killed before its jobs
> > > > > > can
> > > > > > run, the
> > > > > > callback gets triggered but drm_sched_entity_clear_dep()
> > > > > > only
> > > > > > clears the
> > > > > > dependency without waking up the scheduler, causing the
> > > > > > scheduler to enter
> > > > > > sleep state and application B to hang.
> > > > >
> > > > > That now reads as if drm_sched_entity_clear_dep() is supposed
> > > > > to
> > > > > wake
> > > > > up the scheduler. But you add the wakeup to a different
> > > > > function
> > > > > (the
> > > > > work item).
> > > > >
> > > > > Also the code actually looks like that:
> > > > >
> > > > >
> > > > >                 xa_erase(&job->dependencies, index);
> > > > >                 if (f && !dma_fence_add_callback(f, &job-
> > > > > > finish_cb,
> > > > >
> > > > > drm_sched_entity_kill_jobs_cb))
> > > > >                         return;
> > > > >
> > > > >                 dma_fence_put(f);
> > > > >         }
> > > > >
> > > > >         INIT_WORK(&job->work,
> > > > > drm_sched_entity_kill_jobs_work);
> > > > >         schedule_work(&job->work);
> > > > > }
> > > > >
> > > > > So if adding that callback succeeds we.. return immediately
> > > > > but
> > > > > we.. do
> > > > > that for the first dependency, not for all of them?
> > > > >
> > > > > Oh boy. That code base. We(tm) should look into pimping that
> > > > > up.
> > > > > Also
> > > > > lacks documentation everywhere.
> > > > >
> > > > >
> > > > > Anyways. If this solves a bug for you the patch looks fine
> > > > > (i.e.,
> > > > > not
> > > > > potentially harmful) by me and if the tests succeed we can
> > > > > merge
> > > > > it.
> > > > > However, I'd feel better if you could clarify more why that
> > > > > function is
> > > > > the right place to solve the bug.
> > > > >
> > > > >
> > > > > Thanks,
> > > > > P.
> > > > >
> > > > >
> > > > > >
> > > > > >
> > > > > > Add drm_sched_wakeup() in entity_kill_job_work() to prevent
> > > > > > scheduler
> > > > > > sleep and subsequent application hangs.
> > > > > >
> > > > > >
> > > > > > v2:
> > > > > > - Move drm_sched_wakeup() to after
> > > > > > drm_sched_fence_scheduled()
> > > > > >
> > > > > >
> > > > > > Thanks,
> > > > > > Lin
> > > > > >
> > > > > >
> > > > > > From: Koenig, Christian <Christian.Koenig at amd.com>
> > > > > > Sent: Monday, July 14, 2025 21:39
> > > > > > To: phasta at kernel.org <phasta at kernel.org>; cao, lin
> > > > > > <lin.cao at amd.com>;
> > > > > > dri-devel at lists.freedesktop.org <
> > > > > > dri-devel at lists.freedesktop.org
> > > > > > >
> > > > > > Cc: Yin, ZhenGuo (Chris) <ZhenGuo.Yin at amd.com>; Deng, Emily
> > > > > > <Emily.Deng at amd.com>; dakr at kernel.org <dakr at kernel.org>;
> > > > > > matthew.brost at intel.com <matthew.brost at intel.com>
> > > > > > Subject: Re: [PATCH v2] drm/scheduler: Fix sched hang when
> > > > > > killing app with dependent jobs
> > > > > >
> > > > > >
> > > > > >
> > > > > >
> > > > > > On 14.07.25 15:27, Philipp Stanner wrote:
> > > > > > > On Mon, 2025-07-14 at 15:08 +0200, Christian König wrote:
> > > > > > > >
> > > > > > > >
> > > > > > > > On 14.07.25 14:46, Philipp Stanner wrote:
> > > > > > > > > regarding the patch subject: the prefix we use for
> > > > > > > > > the
> > > > > > > > > scheduler
> > > > > > > > > is:
> > > > > > > > > drm/sched:
> > > > > > > > >
> > > > > > > > >
> > > > > > > > > On Mon, 2025-07-14 at 14:23 +0800, Lin.Cao wrote:
> > > > > > > > >
> > > > > > > > > > When Application A submits jobs (a1, a2, a3) and
> > > > > > > > > > application B
> > > > > > > > > > submits
> > > > > > > > >
> > > > > > > > > s/Application/application
> > > > > > > > >
> > > > > > > > > > job b1 with a dependency on a2's scheduler fence,
> > > > > > > > > > killing
> > > > > > > > > > application A
> > > > > > > > > > before run_job(a1) causes
> > > > > > > > > > drm_sched_entity_kill_jobs_work() to
> > > > > > > > > > force
> > > > > > > > > > signal all jobs sequentially. However, due to
> > > > > > > > > > missing
> > > > > > > > > > work_run_job or
> > > > > > > > > > work_free_job
> > > > > > > > > >
> > > > > > > > >
> > > > > > > > > You probably mean "due to the work items work_run_job
> > > > > > > > > and
> > > > > > > > > work_free_job
> > > > > > > > > not being started […]".
> > > > > > > > >
> > > > > > > > > However, the call you add, drm_sched_wakeup(),
> > > > > > > > > immediately only
> > > > > > > > > triggers work_run_job. It's not clear to me why you
> > > > > > > > > mention
> > > > > > > > > work_free_job, because
> > > > > > > > > drm_sched_entity_kill_jobs_work()
> > > > > > > > > directly
> > > > > > > > > invokes the free_job() callback.
> > > > > > > >
> > > > > > > > Yeah the description is rather confusing, took me more
> > > > > > > > than
> > > > > > > > one try
> > > > > > > > to understand what's going on here as well. Let me try
> > > > > > > > to
> > > > > > > > explain it
> > > > > > > > in my words:
> > > > > > > >
> > > > > > > > When an application A is killed the pending submissions
> > > > > > > > from it are
> > > > > > > > not executed, but rather torn down by
> > > > > > > > drm_sched_entity_kill_jobs_work().
> > > > > > > >
> > > > > > > > If now a submission from application B depends on
> > > > > > > > something
> > > > > > > > application A wanted to do on the same scheduler
> > > > > > > > instance
> > > > > > > > the
> > > > > > > > function drm_sched_entity_add_dependency_cb() would
> > > > > > > > have
> > > > > > > > optimized
> > > > > > > > this dependency and assumed that the scheduler work
> > > > > > > > would
> > > > > > > > already run
> > > > > > > > and try to pick a job.
> > > > > > > >
> > > > > > > > But that isn't the case when the submissions from
> > > > > > > > application A are
> > > > > > > > all killed. So make sure to start the scheduler work
> > > > > > > > from
> > > > > > > > drm_sched_entity_kill_jobs_work() to handle that case.
> > > > > > >
> > > > > > > Alright, so the bug then depends on B setting a
> > > > > > > dependency on
> > > > > > > A _after_
> > > > > > > A was killed, doesn't it? Because the optimization in
> > > > > > > _add_dependency_cb() can only have an effect after A's
> > > > > > > jobs
> > > > > > > have been
> > > > > > > torn down.
> > > > > >
> > > > > > No, application A and application B submit right behind
> > > > > > each
> > > > > > other. Application A is then killed later on.
> > > > > >
> > > > > > The optimization in _add_dependency_cb() just waits for As
> > > > > > submission to be handled by the scheduler, but that never
> > > > > > happens because it was killed.
> > > > > >
> > > > > > > If there is such a timing order problem, the commit
> > > > > > > message
> > > > > > > should
> > > > > > > mention it.
> > > > > > >
> > > > > > > The commit message definitely also needs to mention this
> > > > > > > optimization
> > > > > > > in drm_sched_entity_add_dependency_cb() since it's
> > > > > > > essential
> > > > > > > for
> > > > > > > understanding the bug.
> > > > > >
> > > > > > +1
> > > > > >
> > > > > > Christian.
> > > > > >
> > > > > > >
> > > > > > >
> > > > > > > Danke
> > > > > > > P.
> > > > > > >
> > > > > > >
> > > > > > > >
> > > > > > > > Regards,
> > > > > > > > Christian.
> > > > > > > >
> > > > > > > > >
> > > > > > > > > >  in entity_kill_job_work(), the scheduler enters
> > > > > > > > > > sleep
> > > > > > > > > > state, causing application B hang.
> > > > > > > > >
> > > > > > > > > So the issue is that if a1 never ran, the scheduler
> > > > > > > > > will
> > > > > > > > > not
> > > > > > > > > continue
> > > > > > > > > with the jobs of application B, correct?
> > > > > > > > >
> > > > > > > > > >
> > > > > > > > > > Add drm_sched_wakeup() when entity_kill_job_work()
> > > > > > > > > > to
> > > > > > > > > > preventing
> > > > > > > > >
> > > > > > > > > s/to preventing/to prevent
> > > > > > > > >
> > > > > > > > > > scheduler sleep and subsequent application hangs.
> > > > > > > > > >
> > > > > > > > > > v2:
> > > > > > > > > > - Move drm_sched_wakeup() to after
> > > > > > > > > > drm_sched_fence_scheduled()
> > > > > > > > >
> > > > > > > > > Changelogs are cool and useful, but move them below
> > > > > > > > > the
> > > > > > > > > Signed-off
> > > > > > > > > area
> > > > > > > > > with another --- please, like so:
> > > > > > > > >
> > > > > > > > > Signed-off by: …
> > > > > > > > > ---
> > > > > > > > > v2:
> > > > > > > > > …
> > > > > > > > > ---
> > > > > > > > > drivers/gpu/drm/scheduler/sched_entity.c | 1 +
> > > > > > > > >
> > > > > > > > >
> > > > > > > > > >
> > > > > > > > > > Signed-off-by: Lin.Cao <lincao12 at amd.com>
> > > > > > > > >
> > > > > > > > > This fixes a bug. Even a racy bug, it seems. So we
> > > > > > > > > need a
> > > > > > > > > Fixes tag
> > > > > > > > > and
> > > > > > > > > put the stable kernel in CC.
> > > > > > > > >
> > > > > > > > > Please figure out with git blame, git log and git tag
> > > > > > > > > --
> > > > > > > > > contains
> > > > > > > > > which
> > > > > > > > > commit your patch fixes and which is the oldest
> > > > > > > > > kernel
> > > > > > > > > that
> > > > > > > > > contains
> > > > > > > > > the bug. Then add a Fixes: tag and Cc: the stable
> > > > > > > > > kernel.
> > > > > > > > > You'll
> > > > > > > > > find
> > > > > > > > > lots of examples for that in git log.
> > > > > > > > >
> > > > > > > > >
> > > > > > > > > Thx
> > > > > > > > > P.
> > > > > > > > >
> > > > > > > > > > ---
> > > > > > > > > >  drivers/gpu/drm/scheduler/sched_entity.c | 1 +
> > > > > > > > > >  1 file changed, 1 insertion(+)
> > > > > > > > > >
> > > > > > > > > > diff --git
> > > > > > > > > > a/drivers/gpu/drm/scheduler/sched_entity.c
> > > > > > > > > > b/drivers/gpu/drm/scheduler/sched_entity.c
> > > > > > > > > > index e671aa241720..66f2a43c58fd 100644
> > > > > > > > > > --- a/drivers/gpu/drm/scheduler/sched_entity.c
> > > > > > > > > > +++ b/drivers/gpu/drm/scheduler/sched_entity.c
> > > > > > > > > > @@ -177,6 +177,7 @@ static void
> > > > > > > > > > drm_sched_entity_kill_jobs_work(struct work_struct
> > > > > > > > > > *wrk)
> > > > > > > > > >    struct drm_sched_job *job = container_of(wrk,
> > > > > > > > > > typeof(*job), work);
> > > > > > > > > >
> > > > > > > > > >    drm_sched_fence_scheduled(job->s_fence, NULL);
> > > > > > > > > > +  drm_sched_wakeup(job->sched);
> > > > > > > > > >    drm_sched_fence_finished(job->s_fence, -ESRCH);
> > > > > > > > > >    WARN_ON(job->s_fence->parent);
> > > > > > > > > >    job->sched->ops->free_job(job);
> > > > > > > > >
> > > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
>

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://lists.freedesktop.org/archives/dri-devel/attachments/20250715/4c1c9bee/attachment-0001.htm>