[PATCH v2] drm/scheduler: Fix sched hang when killing app with dependent jobs
Christian König
christian.koenig at amd.com
Mon Jul 14 13:08:13 UTC 2025
On 14.07.25 14:46, Philipp Stanner wrote:
> regarding the patch subject: the prefix we use for the scheduler is:
> drm/sched:
>
>
> On Mon, 2025-07-14 at 14:23 +0800, Lin.Cao wrote:
>
>> When Application A submits jobs (a1, a2, a3) and application B submits
>
> s/Application/application
>
>> job b1 with a dependency on a2's scheduler fence, killing application A
>> before run_job(a1) causes drm_sched_entity_kill_jobs_work() to force
>> signal all jobs sequentially. However, due to missing work_run_job or
>> work_free_job
>>
>
> You probably mean "due to the work items work_run_job and work_free_job
> not being started […]".
>
> However, the call you add, drm_sched_wakeup(), immediately only
> triggers work_run_job. It's not clear to me why you mention
> work_free_job, because drm_sched_entity_kill_jobs_work() directly
> invokes the free_job() callback.
Yeah the description is rather confusing, took me more than one try to understand what's going on here as well. Let me try to explain it in my words:
When an application A is killed the pending submissions from it are not executed, but rather torn down by drm_sched_entity_kill_jobs_work().
If now a submission from application B depends on something application A wanted to do on the same scheduler instance the function drm_sched_entity_add_dependency_cb() would have optimized this dependency and assumed that the scheduler work would already run and try to pick a job.
But that isn't the case when the submissions from application A are all killed. So make sure to start the scheduler work from drm_sched_entity_kill_jobs_work() to handle that case.
Regards,
Christian.
>
>> in entity_kill_job_work(), the scheduler enters sleep
>> state, causing application B hang.
>
> So the issue is that if a1 never ran, the scheduler will not continue
> with the jobs of application B, correct?
>
>>
>> Add drm_sched_wakeup() when entity_kill_job_work() to preventing
>
> s/to preventing/to prevent
>
>> scheduler sleep and subsequent application hangs.
>>
>> v2:
>> - Move drm_sched_wakeup() to after drm_sched_fence_scheduled()
>
> Changelogs are cool and useful, but move them below the Signed-off area
> with another --- please, like so:
>
> Signed-off by: …
> ---
> v2:
> …
> ---
> drivers/gpu/drm/scheduler/sched_entity.c | 1 +
>
>
>>
>> Signed-off-by: Lin.Cao <lincao12 at amd.com>
>
> This fixes a bug. Even a racy bug, it seems. So we need a Fixes tag and
> put the stable kernel in CC.
>
> Please figure out with git blame, git log and git tag --contains which
> commit your patch fixes and which is the oldest kernel that contains
> the bug. Then add a Fixes: tag and Cc: the stable kernel. You'll find
> lots of examples for that in git log.
>
>
> Thx
> P.
>
>> ---
>> drivers/gpu/drm/scheduler/sched_entity.c | 1 +
>> 1 file changed, 1 insertion(+)
>>
>> diff --git a/drivers/gpu/drm/scheduler/sched_entity.c b/drivers/gpu/drm/scheduler/sched_entity.c
>> index e671aa241720..66f2a43c58fd 100644
>> --- a/drivers/gpu/drm/scheduler/sched_entity.c
>> +++ b/drivers/gpu/drm/scheduler/sched_entity.c
>> @@ -177,6 +177,7 @@ static void drm_sched_entity_kill_jobs_work(struct work_struct *wrk)
>> struct drm_sched_job *job = container_of(wrk, typeof(*job), work);
>>
>> drm_sched_fence_scheduled(job->s_fence, NULL);
>> + drm_sched_wakeup(job->sched);
>> drm_sched_fence_finished(job->s_fence, -ESRCH);
>> WARN_ON(job->s_fence->parent);
>> job->sched->ops->free_job(job);
>
More information about the dri-devel
mailing list