<html>
<head>
<meta http-equiv="Content-Type" content="text/html; charset=Windows-1252">
<style type="text/css" style="display:none;"> P {margin-top:0;margin-bottom:0;} </style>
</head>
<body dir="ltr">
<p style="font-family:Calibri;font-size:10pt;color:#0000FF;margin:5pt;font-style:normal;font-weight:normal;text-decoration:none;" align="Left">
[AMD Official Use Only - AMD Internal Distribution Only]<br>
</p>
<br>
<div>
<div style="font-family: Aptos, Aptos_EmbeddedFont, Aptos_MSFontService, Calibri, Helvetica, sans-serif; font-size: 11pt; color: rgb(0, 0, 0);" class="elementToProof">
Hi Christian, Philipp,</div>
<div style="font-family: Aptos, Aptos_EmbeddedFont, Aptos_MSFontService, Calibri, Helvetica, sans-serif; font-size: 11pt; color: rgb(0, 0, 0);" class="elementToProof">
<br>
</div>
<div style="font-family: Aptos, Aptos_EmbeddedFont, Aptos_MSFontService, Calibri, Helvetica, sans-serif; font-size: 11pt; color: rgb(0, 0, 0);" class="elementToProof">
I modified the commit msg, and I hope it makes more sense:</div>
<div style="font-family: Aptos, Aptos_EmbeddedFont, Aptos_MSFontService, Calibri, Helvetica, sans-serif; font-size: 11pt; color: rgb(0, 0, 0);" class="elementToProof">
<br>
</div>
<div style="margin-left: 40px; font-family: Aptos, Aptos_EmbeddedFont, Aptos_MSFontService, Calibri, Helvetica, sans-serif; font-size: 11pt; color: rgb(0, 0, 0);" class="elementToProof">
drm/sched: wake up scheduler when killing jobs to prevent hang</div>
<div style="margin-left: 40px; font-family: Aptos, Aptos_EmbeddedFont, Aptos_MSFontService, Calibri, Helvetica, sans-serif; font-size: 11pt; color: rgb(0, 0, 0);" class="elementToProof">
<br>
</div>
<div style="margin-left: 40px; font-family: Aptos, Aptos_EmbeddedFont, Aptos_MSFontService, Calibri, Helvetica, sans-serif; font-size: 11pt; color: rgb(0, 0, 0);" class="elementToProof">
When application A submits jobs (a1, a2, a3) and application B submits</div>
<div style="margin-left: 40px; font-family: Aptos, Aptos_EmbeddedFont, Aptos_MSFontService, Calibri, Helvetica, sans-serif; font-size: 11pt; color: rgb(0, 0, 0);" class="elementToProof">
job b1 with a dependency on a2's scheduler fence, killing application A</div>
<div style="margin-left: 40px; font-family: Aptos, Aptos_EmbeddedFont, Aptos_MSFontService, Calibri, Helvetica, sans-serif; font-size: 11pt; color: rgb(0, 0, 0);" class="elementToProof">
before run_job(a1) causes drm_sched_entity_kill_jobs_work() to force</div>
<div style="margin-left: 40px; font-family: Aptos, Aptos_EmbeddedFont, Aptos_MSFontService, Calibri, Helvetica, sans-serif; font-size: 11pt; color: rgb(0, 0, 0);" class="elementToProof">
signal all jobs sequentially. However, the optimization in</div>
<div style="margin-left: 40px; font-family: Aptos, Aptos_EmbeddedFont, Aptos_MSFontService, Calibri, Helvetica, sans-serif; font-size: 11pt; color: rgb(0, 0, 0);" class="elementToProof">
drm_sched_entity_add_dependency_cb() waits for the fence to be scheduled</div>
<div style="margin-left: 40px; font-family: Aptos, Aptos_EmbeddedFont, Aptos_MSFontService, Calibri, Helvetica, sans-serif; font-size: 11pt; color: rgb(0, 0, 0);" class="elementToProof">
by adding a callback (drm_sched_entity_clear_dep) instead of immediately</div>
<div style="margin-left: 40px; font-family: Aptos, Aptos_EmbeddedFont, Aptos_MSFontService, Calibri, Helvetica, sans-serif; font-size: 11pt; color: rgb(0, 0, 0);" class="elementToProof">
waking up the scheduler. When A is killed before its jobs can run, the</div>
<div style="margin-left: 40px; font-family: Aptos, Aptos_EmbeddedFont, Aptos_MSFontService, Calibri, Helvetica, sans-serif; font-size: 11pt; color: rgb(0, 0, 0);" class="elementToProof">
callback gets triggered but drm_sched_entity_clear_dep() only clears the</div>
<div style="margin-left: 40px; font-family: Aptos, Aptos_EmbeddedFont, Aptos_MSFontService, Calibri, Helvetica, sans-serif; font-size: 11pt; color: rgb(0, 0, 0);" class="elementToProof">
dependency without waking up the scheduler, causing the scheduler to enter</div>
<div style="margin-left: 40px; font-family: Aptos, Aptos_EmbeddedFont, Aptos_MSFontService, Calibri, Helvetica, sans-serif; font-size: 11pt; color: rgb(0, 0, 0);" class="elementToProof">
sleep state and application B to hang.</div>
<div style="margin-left: 40px; font-family: Aptos, Aptos_EmbeddedFont, Aptos_MSFontService, Calibri, Helvetica, sans-serif; font-size: 11pt; color: rgb(0, 0, 0);" class="elementToProof">
<br>
</div>
<div style="margin-left: 40px; font-family: Aptos, Aptos_EmbeddedFont, Aptos_MSFontService, Calibri, Helvetica, sans-serif; font-size: 11pt; color: rgb(0, 0, 0);" class="elementToProof">
Add drm_sched_wakeup() in entity_kill_job_work() to prevent scheduler</div>
<div style="margin-left: 40px; font-family: Aptos, Aptos_EmbeddedFont, Aptos_MSFontService, Calibri, Helvetica, sans-serif; font-size: 11pt; color: rgb(0, 0, 0);" class="elementToProof">
sleep and subsequent application hangs.</div>
<div style="margin-left: 40px; font-family: Aptos, Aptos_EmbeddedFont, Aptos_MSFontService, Calibri, Helvetica, sans-serif; font-size: 11pt; color: rgb(0, 0, 0);" class="elementToProof">
<br>
</div>
<div style="margin-left: 40px; font-family: Aptos, Aptos_EmbeddedFont, Aptos_MSFontService, Calibri, Helvetica, sans-serif; font-size: 11pt; color: rgb(0, 0, 0);" class="elementToProof">
v2:</div>
<div style="margin-left: 40px; font-family: Aptos, Aptos_EmbeddedFont, Aptos_MSFontService, Calibri, Helvetica, sans-serif; font-size: 11pt; color: rgb(0, 0, 0);" class="elementToProof">
- Move drm_sched_wakeup() to after drm_sched_fence_scheduled()</div>
<div style="font-family: Aptos, Aptos_EmbeddedFont, Aptos_MSFontService, Calibri, Helvetica, sans-serif; font-size: 11pt; color: rgb(0, 0, 0);" class="elementToProof">
<br>
</div>
<div style="font-family: Aptos, Aptos_EmbeddedFont, Aptos_MSFontService, Calibri, Helvetica, sans-serif; font-size: 11pt; color: rgb(0, 0, 0);" class="elementToProof">
Thanks,</div>
<div style="font-family: Aptos, Aptos_EmbeddedFont, Aptos_MSFontService, Calibri, Helvetica, sans-serif; font-size: 11pt; color: rgb(0, 0, 0);" class="elementToProof">
Lin</div>
<div id="appendonsend"></div>
<hr style="display:inline-block;width:98%" tabindex="-1">
<div id="divRplyFwdMsg" dir="ltr"><font face="Calibri, sans-serif" style="font-size:11pt" color="#000000"><b>From:</b> Koenig, Christian <Christian.Koenig@amd.com><br>
<b>Sent:</b> Monday, July 14, 2025 21:39<br>
<b>To:</b> phasta@kernel.org <phasta@kernel.org>; cao, lin <lin.cao@amd.com>; dri-devel@lists.freedesktop.org <dri-devel@lists.freedesktop.org><br>
<b>Cc:</b> Yin, ZhenGuo (Chris) <ZhenGuo.Yin@amd.com>; Deng, Emily <Emily.Deng@amd.com>; dakr@kernel.org <dakr@kernel.org>; matthew.brost@intel.com <matthew.brost@intel.com><br>
<b>Subject:</b> Re: [PATCH v2] drm/scheduler: Fix sched hang when killing app with dependent jobs</font>
<div> </div>
</div>
<div class="BodyFragment"><font size="2"><span style="font-size:11pt;">
<div class="PlainText">On 14.07.25 15:27, Philipp Stanner wrote:<br>
> On Mon, 2025-07-14 at 15:08 +0200, Christian König wrote:<br>
>><br>
>><br>
>> On 14.07.25 14:46, Philipp Stanner wrote:<br>
>>> regarding the patch subject: the prefix we use for the scheduler<br>
>>> is:<br>
>>> drm/sched:<br>
>>><br>
>>><br>
>>> On Mon, 2025-07-14 at 14:23 +0800, Lin.Cao wrote:<br>
>>><br>
>>>> When Application A submits jobs (a1, a2, a3) and application B<br>
>>>> submits<br>
>>><br>
>>> s/Application/application<br>
>>><br>
>>>> job b1 with a dependency on a2's scheduler fence, killing<br>
>>>> application A<br>
>>>> before run_job(a1) causes drm_sched_entity_kill_jobs_work() to<br>
>>>> force<br>
>>>> signal all jobs sequentially. However, due to missing<br>
>>>> work_run_job or<br>
>>>> work_free_job<br>
>>>><br>
>>><br>
>>> You probably mean "due to the work items work_run_job and<br>
>>> work_free_job<br>
>>> not being started […]".<br>
>>><br>
>>> However, the call you add, drm_sched_wakeup(), immediately only<br>
>>> triggers work_run_job. It's not clear to me why you mention<br>
>>> work_free_job, because drm_sched_entity_kill_jobs_work() directly<br>
>>> invokes the free_job() callback.<br>
>><br>
>> Yeah the description is rather confusing, took me more than one try<br>
>> to understand what's going on here as well. Let me try to explain it<br>
>> in my words:<br>
>><br>
>> When an application A is killed the pending submissions from it are<br>
>> not executed, but rather torn down by<br>
>> drm_sched_entity_kill_jobs_work().<br>
>><br>
>> If now a submission from application B depends on something<br>
>> application A wanted to do on the same scheduler instance the<br>
>> function drm_sched_entity_add_dependency_cb() would have optimized<br>
>> this dependency and assumed that the scheduler work would already run<br>
>> and try to pick a job.<br>
>><br>
>> But that isn't the case when the submissions from application A are<br>
>> all killed. So make sure to start the scheduler work from<br>
>> drm_sched_entity_kill_jobs_work() to handle that case.<br>
> <br>
> Alright, so the bug then depends on B setting a dependency on A _after_<br>
> A was killed, doesn't it? Because the optimization in<br>
> _add_dependency_cb() can only have an effect after A's jobs have been<br>
> torn down.<br>
<br>
No, application A and application B submit right behind each other. Application A is then killed later on.<br>
<br>
The optimization in _add_dependency_cb() just waits for As submission to be handled by the scheduler, but that never happens because it was killed.<br>
<br>
> If there is such a timing order problem, the commit message should<br>
> mention it.<br>
> <br>
> The commit message definitely also needs to mention this optimization<br>
> in drm_sched_entity_add_dependency_cb() since it's essential for<br>
> understanding the bug.<br>
<br>
+1<br>
<br>
Christian.<br>
<br>
> <br>
> <br>
> Danke<br>
> P.<br>
> <br>
> <br>
>><br>
>> Regards,<br>
>> Christian.<br>
>><br>
>>><br>
>>>> in entity_kill_job_work(), the scheduler enters sleep<br>
>>>> state, causing application B hang.<br>
>>><br>
>>> So the issue is that if a1 never ran, the scheduler will not<br>
>>> continue<br>
>>> with the jobs of application B, correct?<br>
>>><br>
>>>><br>
>>>> Add drm_sched_wakeup() when entity_kill_job_work() to preventing<br>
>>><br>
>>> s/to preventing/to prevent<br>
>>><br>
>>>> scheduler sleep and subsequent application hangs.<br>
>>>><br>
>>>> v2:<br>
>>>> - Move drm_sched_wakeup() to after drm_sched_fence_scheduled()<br>
>>><br>
>>> Changelogs are cool and useful, but move them below the Signed-off<br>
>>> area<br>
>>> with another --- please, like so:<br>
>>><br>
>>> Signed-off by: …<br>
>>> ---<br>
>>> v2:<br>
>>> …<br>
>>> ---<br>
>>> drivers/gpu/drm/scheduler/sched_entity.c | 1 +<br>
>>><br>
>>><br>
>>>><br>
>>>> Signed-off-by: Lin.Cao <lincao12@amd.com><br>
>>><br>
>>> This fixes a bug. Even a racy bug, it seems. So we need a Fixes tag<br>
>>> and<br>
>>> put the stable kernel in CC.<br>
>>><br>
>>> Please figure out with git blame, git log and git tag --contains<br>
>>> which<br>
>>> commit your patch fixes and which is the oldest kernel that<br>
>>> contains<br>
>>> the bug. Then add a Fixes: tag and Cc: the stable kernel. You'll<br>
>>> find<br>
>>> lots of examples for that in git log.<br>
>>><br>
>>><br>
>>> Thx<br>
>>> P.<br>
>>><br>
>>>> ---<br>
>>>> drivers/gpu/drm/scheduler/sched_entity.c | 1 +<br>
>>>> 1 file changed, 1 insertion(+)<br>
>>>><br>
>>>> diff --git a/drivers/gpu/drm/scheduler/sched_entity.c<br>
>>>> b/drivers/gpu/drm/scheduler/sched_entity.c<br>
>>>> index e671aa241720..66f2a43c58fd 100644<br>
>>>> --- a/drivers/gpu/drm/scheduler/sched_entity.c<br>
>>>> +++ b/drivers/gpu/drm/scheduler/sched_entity.c<br>
>>>> @@ -177,6 +177,7 @@ static void<br>
>>>> drm_sched_entity_kill_jobs_work(struct work_struct *wrk)<br>
>>>> struct drm_sched_job *job = container_of(wrk,<br>
>>>> typeof(*job), work);<br>
>>>> <br>
>>>> drm_sched_fence_scheduled(job->s_fence, NULL);<br>
>>>> + drm_sched_wakeup(job->sched);<br>
>>>> drm_sched_fence_finished(job->s_fence, -ESRCH);<br>
>>>> WARN_ON(job->s_fence->parent);<br>
>>>> job->sched->ops->free_job(job);<br>
>>><br>
>><br>
> <br>
<br>
</div>
</span></font></div>
</div>
</body>
</html>