[PATCH v2] drm/scheduler: Fix sched hang when killing app with dependent jobs

Tue Jul 15 10:52:46 UTC 2025

On 15.07.25 12:27, Philipp Stanner wrote:
> On Tue, 2025-07-15 at 09:51 +0000, cao, lin wrote:
>>
>> [AMD Official Use Only - AMD Internal Distribution Only]
>>
>>
>>
>> Hi Philipp,
>>
>>
>> Thanks for your review, let me try to clarify why I added drm_sched_wakeup() to drm_sched_entity_kill_jobs_work():
>>
>>
>> When application A submits jobs (a1, a2, a3) and application B submits job b1 with a dependency on a1's scheduled fence, the normal execution flow is (function drm_sched_run_job_work()):
>> 1. a1 gets popped from the entity by the scheduler
>> 2. run_job(a1) executes
>> 3. a1's scheduled fence gets signaled 
>> 4. drm_sched_run_job_work() calls drm_sched_run_job_queue() at the end
>> 5. The scheduler wakes up and re-selects entities to pop jobs
>> 6. Since b1's dependency is cleared, the scheduler can select b1 and continue the same flow
>>
>>
>> However, if application A is killed before a1 gets popped by the scheduler, then a1, a2, a3 are killed sequentially by drm_sched_entity_kill_jobs_cb(). During the kill process, their scheduled fences are still signaled, but the kill process itself lacks work_run_job. This means b1's dependency gets cleared, but there's no work_run_job to drive the scheduler to continue running, or we can saystep 4 in the normal flow is missing, and the scheduler enters sleep state, which causes application B to hang.
>> So if we add drm_sched_wakeup() in drm_sched_entity_kill_jobs_work() after drm_sched_fence_scheduled(), the scheduler can be woken up, and application B can continue running after a1's scheduled fence is force signaled.
> 
> Thanks for the detailed explanation. Makes sense.
> Maybe you can detail in v3 that this "optimization" consists of the
> work item not being scheduled.

Yeah, removing this "optimization" would also be a valid solution to the problem.

Christian.

> I think that was the piece of the puzzle
> I was missing.
> 
> I / DRM tools will also include a link to this thread, so I think that
> will then be sufficient.
> 
> Thx
> P.
> 
>>
>> Thanks,
>> Lin
>>
>>
>>
>>
>>
>> From: Philipp Stanner <phasta at mailbox.org>
>> Sent: Tuesday, July 15, 2025 17:12
>> To: cao, lin <lin.cao at amd.com>; Koenig, Christian <Christian.Koenig at amd.com>; phasta at kernel.org <phasta at kernel.org>; dri-devel at lists.freedesktop.org <dri-devel at lists.freedesktop.org>
>> Cc: Yin, ZhenGuo (Chris) <ZhenGuo.Yin at amd.com>; Deng, Emily <Emily.Deng at amd.com>; dakr at kernel.org <dakr at kernel.org>; matthew.brost at intel.com <matthew.brost at intel.com>
>> Subject: Re: [PATCH v2] drm/scheduler: Fix sched hang when killing app with dependent jobs
>>
>>
>> On Tue, 2025-07-15 at 03:38 +0000, cao, lin wrote:
>>>
>>> [AMD Official Use Only - AMD Internal Distribution Only]
>>>
>>>
>>>
>>> Hi Christian, Philipp,
>>>
>>>
>>> I modified the commit msg, and I hope it makes more sense:
>>>
>>>
>>> drm/sched: wake up scheduler when killing jobs to prevent hang
>>
>> nit:
>> s/wake/Wake
>>
>>>
>>>
>>> When application A submits jobs (a1, a2, a3) and application B submits
>>> job b1 with a dependency on a2's scheduler fence, killing application A
>>> before run_job(a1) causes drm_sched_entity_kill_jobs_work() to force
>>> signal all jobs sequentially. However, the optimization in
>>> drm_sched_entity_add_dependency_cb() waits for the fence to be scheduled
>>> by adding a callback (drm_sched_entity_clear_dep) instead of immediately
>>> waking up the scheduler. When A is killed before its jobs can run, the
>>> callback gets triggered but drm_sched_entity_clear_dep() only clears the
>>> dependency without waking up the scheduler, causing the scheduler to enter
>>> sleep state and application B to hang.
>>
>> That now reads as if drm_sched_entity_clear_dep() is supposed to wake
>> up the scheduler. But you add the wakeup to a different function (the
>> work item).
>>
>> Also the code actually looks like that:
>>
>>
>>                 xa_erase(&job->dependencies, index);
>>                 if (f && !dma_fence_add_callback(f, &job->finish_cb,
>>                                                  drm_sched_entity_kill_jobs_cb))
>>                         return;
>>
>>                 dma_fence_put(f);
>>         }
>>
>>         INIT_WORK(&job->work, drm_sched_entity_kill_jobs_work);
>>         schedule_work(&job->work);
>> }
>>
>> So if adding that callback succeeds we.. return immediately but we.. do
>> that for the first dependency, not for all of them?
>>
>> Oh boy. That code base. We(tm) should look into pimping that up. Also
>> lacks documentation everywhere.
>>
>>
>> Anyways. If this solves a bug for you the patch looks fine (i.e., not
>> potentially harmful) by me and if the tests succeed we can merge it.
>> However, I'd feel better if you could clarify more why that function is
>> the right place to solve the bug.
>>
>>
>> Thanks,
>> P.
>>
>>
>>>
>>>
>>> Add drm_sched_wakeup() in entity_kill_job_work() to prevent scheduler
>>> sleep and subsequent application hangs.
>>>
>>>
>>> v2:
>>> - Move drm_sched_wakeup() to after drm_sched_fence_scheduled()
>>>
>>>
>>> Thanks,
>>> Lin
>>>
>>>
>>> From: Koenig, Christian <Christian.Koenig at amd.com>
>>> Sent: Monday, July 14, 2025 21:39
>>> To: phasta at kernel.org <phasta at kernel.org>; cao, lin <lin.cao at amd.com>; dri-devel at lists.freedesktop.org <dri-devel at lists.freedesktop.org>
>>> Cc: Yin, ZhenGuo (Chris) <ZhenGuo.Yin at amd.com>; Deng, Emily <Emily.Deng at amd.com>; dakr at kernel.org <dakr at kernel.org>; matthew.brost at intel.com <matthew.brost at intel.com>
>>> Subject: Re: [PATCH v2] drm/scheduler: Fix sched hang when killing app with dependent jobs
>>>
>>>  
>>>
>>>
>>> On 14.07.25 15:27, Philipp Stanner wrote:
>>>> On Mon, 2025-07-14 at 15:08 +0200, Christian König wrote:
>>>>>
>>>>>
>>>>> On 14.07.25 14:46, Philipp Stanner wrote:
>>>>>> regarding the patch subject: the prefix we use for the scheduler
>>>>>> is:
>>>>>> drm/sched:
>>>>>>
>>>>>>
>>>>>> On Mon, 2025-07-14 at 14:23 +0800, Lin.Cao wrote:
>>>>>>
>>>>>>> When Application A submits jobs (a1, a2, a3) and application B
>>>>>>> submits
>>>>>>
>>>>>> s/Application/application
>>>>>>
>>>>>>> job b1 with a dependency on a2's scheduler fence, killing
>>>>>>> application A
>>>>>>> before run_job(a1) causes drm_sched_entity_kill_jobs_work() to
>>>>>>> force
>>>>>>> signal all jobs sequentially. However, due to missing
>>>>>>> work_run_job or
>>>>>>> work_free_job
>>>>>>>
>>>>>>
>>>>>> You probably mean "due to the work items work_run_job and
>>>>>> work_free_job
>>>>>> not being started […]".
>>>>>>
>>>>>> However, the call you add, drm_sched_wakeup(), immediately only
>>>>>> triggers work_run_job. It's not clear to me why you mention
>>>>>> work_free_job, because drm_sched_entity_kill_jobs_work() directly
>>>>>> invokes the free_job() callback.
>>>>>
>>>>> Yeah the description is rather confusing, took me more than one try
>>>>> to understand what's going on here as well. Let me try to explain it
>>>>> in my words:
>>>>>
>>>>> When an application A is killed the pending submissions from it are
>>>>> not executed, but rather torn down by
>>>>> drm_sched_entity_kill_jobs_work().
>>>>>
>>>>> If now a submission from application B depends on something
>>>>> application A wanted to do on the same scheduler instance the
>>>>> function drm_sched_entity_add_dependency_cb() would have optimized
>>>>> this dependency and assumed that the scheduler work would already run
>>>>> and try to pick a job.
>>>>>
>>>>> But that isn't the case when the submissions from application A are
>>>>> all killed. So make sure to start the scheduler work from
>>>>> drm_sched_entity_kill_jobs_work() to handle that case.
>>>>
>>>> Alright, so the bug then depends on B setting a dependency on A _after_
>>>> A was killed, doesn't it? Because the optimization in
>>>> _add_dependency_cb() can only have an effect after A's jobs have been
>>>> torn down.
>>>
>>> No, application A and application B submit right behind each other. Application A is then killed later on.
>>>
>>> The optimization in _add_dependency_cb() just waits for As submission to be handled by the scheduler, but that never happens because it was killed.
>>>
>>>> If there is such a timing order problem, the commit message should
>>>> mention it.
>>>>
>>>> The commit message definitely also needs to mention this optimization
>>>> in drm_sched_entity_add_dependency_cb() since it's essential for
>>>> understanding the bug.
>>>
>>> +1
>>>
>>> Christian.
>>>
>>>>
>>>>
>>>> Danke
>>>> P.
>>>>
>>>>
>>>>>
>>>>> Regards,
>>>>> Christian.
>>>>>
>>>>>>
>>>>>>>  in entity_kill_job_work(), the scheduler enters sleep
>>>>>>> state, causing application B hang.
>>>>>>
>>>>>> So the issue is that if a1 never ran, the scheduler will not
>>>>>> continue
>>>>>> with the jobs of application B, correct?
>>>>>>
>>>>>>>
>>>>>>> Add drm_sched_wakeup() when entity_kill_job_work() to preventing
>>>>>>
>>>>>> s/to preventing/to prevent
>>>>>>
>>>>>>> scheduler sleep and subsequent application hangs.
>>>>>>>
>>>>>>> v2:
>>>>>>> - Move drm_sched_wakeup() to after drm_sched_fence_scheduled()
>>>>>>
>>>>>> Changelogs are cool and useful, but move them below the Signed-off
>>>>>> area
>>>>>> with another --- please, like so:
>>>>>>
>>>>>> Signed-off by: …
>>>>>> ---
>>>>>> v2:
>>>>>> …
>>>>>> ---
>>>>>> drivers/gpu/drm/scheduler/sched_entity.c | 1 +
>>>>>>
>>>>>>
>>>>>>>
>>>>>>> Signed-off-by: Lin.Cao <lincao12 at amd.com>
>>>>>>
>>>>>> This fixes a bug. Even a racy bug, it seems. So we need a Fixes tag
>>>>>> and
>>>>>> put the stable kernel in CC.
>>>>>>
>>>>>> Please figure out with git blame, git log and git tag --contains
>>>>>> which
>>>>>> commit your patch fixes and which is the oldest kernel that
>>>>>> contains
>>>>>> the bug. Then add a Fixes: tag and Cc: the stable kernel. You'll
>>>>>> find
>>>>>> lots of examples for that in git log.
>>>>>>
>>>>>>
>>>>>> Thx
>>>>>> P.
>>>>>>
>>>>>>> ---
>>>>>>>  drivers/gpu/drm/scheduler/sched_entity.c | 1 +
>>>>>>>  1 file changed, 1 insertion(+)
>>>>>>>
>>>>>>> diff --git a/drivers/gpu/drm/scheduler/sched_entity.c
>>>>>>> b/drivers/gpu/drm/scheduler/sched_entity.c
>>>>>>> index e671aa241720..66f2a43c58fd 100644
>>>>>>> --- a/drivers/gpu/drm/scheduler/sched_entity.c
>>>>>>> +++ b/drivers/gpu/drm/scheduler/sched_entity.c
>>>>>>> @@ -177,6 +177,7 @@ static void
>>>>>>> drm_sched_entity_kill_jobs_work(struct work_struct *wrk)
>>>>>>>    struct drm_sched_job *job = container_of(wrk,
>>>>>>> typeof(*job), work);
>>>>>>>  
>>>>>>>    drm_sched_fence_scheduled(job->s_fence, NULL);
>>>>>>> +  drm_sched_wakeup(job->sched);
>>>>>>>    drm_sched_fence_finished(job->s_fence, -ESRCH);
>>>>>>>    WARN_ON(job->s_fence->parent);
>>>>>>>    job->sched->ops->free_job(job);
>>>>>>
>>>>>
>>>>
>>>
>>
>