[PATCH v2] drm/scheduler: Fix sched hang when killing app with dependent jobs

Tue Jul 15 09:12:16 UTC 2025

On Tue, 2025-07-15 at 03:38 +0000, cao, lin wrote:
> 
> [AMD Official Use Only - AMD Internal Distribution Only]
> 
> 
> 
> Hi Christian, Philipp,
> 
> 
> I modified the commit msg, and I hope it makes more sense:
> 
> 
> drm/sched: wake up scheduler when killing jobs to prevent hang

nit:
s/wake/Wake

> 
> 
> When application A submits jobs (a1, a2, a3) and application B submits
> job b1 with a dependency on a2's scheduler fence, killing application A
> before run_job(a1) causes drm_sched_entity_kill_jobs_work() to force
> signal all jobs sequentially. However, the optimization in
> drm_sched_entity_add_dependency_cb() waits for the fence to be scheduled
> by adding a callback (drm_sched_entity_clear_dep) instead of immediately
> waking up the scheduler. When A is killed before its jobs can run, the
> callback gets triggered but drm_sched_entity_clear_dep() only clears the
> dependency without waking up the scheduler, causing the scheduler to enter
> sleep state and application B to hang.

That now reads as if drm_sched_entity_clear_dep() is supposed to wake
up the scheduler. But you add the wakeup to a different function (the
work item).

Also the code actually looks like that:

		xa_erase(&job->dependencies, index);
		if (f && !dma_fence_add_callback(f, &job->finish_cb,
						 drm_sched_entity_kill_jobs_cb))
			return;

		dma_fence_put(f);
	}

	INIT_WORK(&job->work, drm_sched_entity_kill_jobs_work);
	schedule_work(&job->work);
}

So if adding that callback succeeds we.. return immediately but we.. do
that for the first dependency, not for all of them?

Oh boy. That code base. We(tm) should look into pimping that up. Also
lacks documentation everywhere.

Anyways. If this solves a bug for you the patch looks fine (i.e., not
potentially harmful) by me and if the tests succeed we can merge it.
However, I'd feel better if you could clarify more why that function is
the right place to solve the bug.

Thanks,
P.

> 
> 
> Add drm_sched_wakeup() in entity_kill_job_work() to prevent scheduler
> sleep and subsequent application hangs.
> 
> 
> v2:
> - Move drm_sched_wakeup() to after drm_sched_fence_scheduled()
> 
> 
> Thanks,
> Lin
> 
> 
> From: Koenig, Christian <Christian.Koenig at amd.com>
> Sent: Monday, July 14, 2025 21:39
> To: phasta at kernel.org <phasta at kernel.org>; cao, lin <lin.cao at amd.com>; dri-devel at lists.freedesktop.org <dri-devel at lists.freedesktop.org>
> Cc: Yin, ZhenGuo (Chris) <ZhenGuo.Yin at amd.com>; Deng, Emily <Emily.Deng at amd.com>; dakr at kernel.org <dakr at kernel.org>; matthew.brost at intel.com <matthew.brost at intel.com>
> Subject: Re: [PATCH v2] drm/scheduler: Fix sched hang when killing app with dependent jobs
> 
>  
> 
> 
> On 14.07.25 15:27, Philipp Stanner wrote:
> > On Mon, 2025-07-14 at 15:08 +0200, Christian König wrote:
> > > 
> > > 
> > > On 14.07.25 14:46, Philipp Stanner wrote:
> > > > regarding the patch subject: the prefix we use for the scheduler
> > > > is:
> > > > drm/sched:
> > > > 
> > > > 
> > > > On Mon, 2025-07-14 at 14:23 +0800, Lin.Cao wrote:
> > > > 
> > > > > When Application A submits jobs (a1, a2, a3) and application B
> > > > > submits
> > > > 
> > > > s/Application/application
> > > > 
> > > > > job b1 with a dependency on a2's scheduler fence, killing
> > > > > application A
> > > > > before run_job(a1) causes drm_sched_entity_kill_jobs_work() to
> > > > > force
> > > > > signal all jobs sequentially. However, due to missing
> > > > > work_run_job or
> > > > > work_free_job
> > > > > 
> > > > 
> > > > You probably mean "due to the work items work_run_job and
> > > > work_free_job
> > > > not being started […]".
> > > > 
> > > > However, the call you add, drm_sched_wakeup(), immediately only
> > > > triggers work_run_job. It's not clear to me why you mention
> > > > work_free_job, because drm_sched_entity_kill_jobs_work() directly
> > > > invokes the free_job() callback.
> > > 
> > > Yeah the description is rather confusing, took me more than one try
> > > to understand what's going on here as well. Let me try to explain it
> > > in my words:
> > > 
> > > When an application A is killed the pending submissions from it are
> > > not executed, but rather torn down by
> > > drm_sched_entity_kill_jobs_work().
> > > 
> > > If now a submission from application B depends on something
> > > application A wanted to do on the same scheduler instance the
> > > function drm_sched_entity_add_dependency_cb() would have optimized
> > > this dependency and assumed that the scheduler work would already run
> > > and try to pick a job.
> > > 
> > > But that isn't the case when the submissions from application A are
> > > all killed. So make sure to start the scheduler work from
> > > drm_sched_entity_kill_jobs_work() to handle that case.
> > 
> > Alright, so the bug then depends on B setting a dependency on A _after_
> > A was killed, doesn't it? Because the optimization in
> > _add_dependency_cb() can only have an effect after A's jobs have been
> > torn down.
> 
> No, application A and application B submit right behind each other. Application A is then killed later on.
> 
> The optimization in _add_dependency_cb() just waits for As submission to be handled by the scheduler, but that never happens because it was killed.
> 
> > If there is such a timing order problem, the commit message should
> > mention it.
> > 
> > The commit message definitely also needs to mention this optimization
> > in drm_sched_entity_add_dependency_cb() since it's essential for
> > understanding the bug.
> 
> +1
> 
> Christian.
> 
> > 
> > 
> > Danke
> > P.
> > 
> > 
> > > 
> > > Regards,
> > > Christian.
> > > 
> > > > 
> > > > >  in entity_kill_job_work(), the scheduler enters sleep
> > > > > state, causing application B hang.
> > > > 
> > > > So the issue is that if a1 never ran, the scheduler will not
> > > > continue
> > > > with the jobs of application B, correct?
> > > > 
> > > > > 
> > > > > Add drm_sched_wakeup() when entity_kill_job_work() to preventing
> > > > 
> > > > s/to preventing/to prevent
> > > > 
> > > > > scheduler sleep and subsequent application hangs.
> > > > > 
> > > > > v2:
> > > > > - Move drm_sched_wakeup() to after drm_sched_fence_scheduled()
> > > > 
> > > > Changelogs are cool and useful, but move them below the Signed-off
> > > > area
> > > > with another --- please, like so:
> > > > 
> > > > Signed-off by: …
> > > > ---
> > > > v2:
> > > > …
> > > > ---
> > > > drivers/gpu/drm/scheduler/sched_entity.c | 1 +
> > > > 
> > > > 
> > > > > 
> > > > > Signed-off-by: Lin.Cao <lincao12 at amd.com>
> > > > 
> > > > This fixes a bug. Even a racy bug, it seems. So we need a Fixes tag
> > > > and
> > > > put the stable kernel in CC.
> > > > 
> > > > Please figure out with git blame, git log and git tag --contains
> > > > which
> > > > commit your patch fixes and which is the oldest kernel that
> > > > contains
> > > > the bug. Then add a Fixes: tag and Cc: the stable kernel. You'll
> > > > find
> > > > lots of examples for that in git log.
> > > > 
> > > > 
> > > > Thx
> > > > P.
> > > > 
> > > > > ---
> > > > >  drivers/gpu/drm/scheduler/sched_entity.c | 1 +
> > > > >  1 file changed, 1 insertion(+)
> > > > > 
> > > > > diff --git a/drivers/gpu/drm/scheduler/sched_entity.c
> > > > > b/drivers/gpu/drm/scheduler/sched_entity.c
> > > > > index e671aa241720..66f2a43c58fd 100644
> > > > > --- a/drivers/gpu/drm/scheduler/sched_entity.c
> > > > > +++ b/drivers/gpu/drm/scheduler/sched_entity.c
> > > > > @@ -177,6 +177,7 @@ static void
> > > > > drm_sched_entity_kill_jobs_work(struct work_struct *wrk)
> > > > >    struct drm_sched_job *job = container_of(wrk,
> > > > > typeof(*job), work);
> > > > >  
> > > > >    drm_sched_fence_scheduled(job->s_fence, NULL);
> > > > > +  drm_sched_wakeup(job->sched);
> > > > >    drm_sched_fence_finished(job->s_fence, -ESRCH);
> > > > >    WARN_ON(job->s_fence->parent);
> > > > >    job->sched->ops->free_job(job);
> > > > 
> > > 
> > 
>