[PATCH v2] drm/scheduler: Fix sched hang when killing app with dependent jobs

Mon Jul 14 13:08:13 UTC 2025

On 14.07.25 14:46, Philipp Stanner wrote:
> regarding the patch subject: the prefix we use for the scheduler is:
> drm/sched:
> 
> 
> On Mon, 2025-07-14 at 14:23 +0800, Lin.Cao wrote:
> 
>> When Application A submits jobs (a1, a2, a3) and application B submits
> 
> s/Application/application
> 
>> job b1 with a dependency on a2's scheduler fence, killing application A
>> before run_job(a1) causes drm_sched_entity_kill_jobs_work() to force
>> signal all jobs sequentially. However, due to missing work_run_job or
>> work_free_job
>>
> 
> You probably mean "due to the work items work_run_job and work_free_job
> not being started […]".
> 
> However, the call you add, drm_sched_wakeup(), immediately only
> triggers work_run_job. It's not clear to me why you mention
> work_free_job, because drm_sched_entity_kill_jobs_work() directly
> invokes the free_job() callback.

Yeah the description is rather confusing, took me more than one try to understand what's going on here as well. Let me try to explain it in my words:

When an application A is killed the pending submissions from it are not executed, but rather torn down by drm_sched_entity_kill_jobs_work().

If now a submission from application B depends on something application A wanted to do on the same scheduler instance the function drm_sched_entity_add_dependency_cb() would have optimized this dependency and assumed that the scheduler work would already run and try to pick a job.

But that isn't the case when the submissions from application A are all killed. So make sure to start the scheduler work from drm_sched_entity_kill_jobs_work() to handle that case.

Regards,
Christian.

> 
>>  in entity_kill_job_work(), the scheduler enters sleep
>> state, causing application B hang.
> 
> So the issue is that if a1 never ran, the scheduler will not continue
> with the jobs of application B, correct?
> 
>>
>> Add drm_sched_wakeup() when entity_kill_job_work() to preventing
> 
> s/to preventing/to prevent
> 
>> scheduler sleep and subsequent application hangs.
>>
>> v2:
>> - Move drm_sched_wakeup() to after drm_sched_fence_scheduled()
> 
> Changelogs are cool and useful, but move them below the Signed-off area
> with another --- please, like so:
> 
> Signed-off by: …
> ---
> v2:
> …
> ---
> drivers/gpu/drm/scheduler/sched_entity.c | 1 +
> 
> 
>>
>> Signed-off-by: Lin.Cao <lincao12 at amd.com>
> 
> This fixes a bug. Even a racy bug, it seems. So we need a Fixes tag and
> put the stable kernel in CC.
> 
> Please figure out with git blame, git log and git tag --contains which
> commit your patch fixes and which is the oldest kernel that contains
> the bug. Then add a Fixes: tag and Cc: the stable kernel. You'll find
> lots of examples for that in git log.
> 
> 
> Thx
> P.
> 
>> ---
>>  drivers/gpu/drm/scheduler/sched_entity.c | 1 +
>>  1 file changed, 1 insertion(+)
>>
>> diff --git a/drivers/gpu/drm/scheduler/sched_entity.c b/drivers/gpu/drm/scheduler/sched_entity.c
>> index e671aa241720..66f2a43c58fd 100644
>> --- a/drivers/gpu/drm/scheduler/sched_entity.c
>> +++ b/drivers/gpu/drm/scheduler/sched_entity.c
>> @@ -177,6 +177,7 @@ static void drm_sched_entity_kill_jobs_work(struct work_struct *wrk)
>>  	struct drm_sched_job *job = container_of(wrk, typeof(*job), work);
>>  
>>  	drm_sched_fence_scheduled(job->s_fence, NULL);
>> +	drm_sched_wakeup(job->sched);
>>  	drm_sched_fence_finished(job->s_fence, -ESRCH);
>>  	WARN_ON(job->s_fence->parent);
>>  	job->sched->ops->free_job(job);
>