<html>
<head>
<meta http-equiv="Content-Type" content="text/html; charset=Windows-1252">
<style type="text/css" style="display:none;"> P {margin-top:0;margin-bottom:0;} </style>
</head>
<body dir="ltr">
<p style="font-family:Calibri;font-size:10pt;color:#0000FF;margin:5pt;font-style:normal;font-weight:normal;text-decoration:none;" align="Left">
[AMD Official Use Only - AMD Internal Distribution Only]<br>
</p>
<br>
<div>
<div style="font-family: Aptos, Aptos_EmbeddedFont, Aptos_MSFontService, Calibri, Helvetica, sans-serif; font-size: 11pt; color: rgb(0, 0, 0);" class="elementToProof">
Hi Philipp,</div>
<div style="font-family: Aptos, Aptos_EmbeddedFont, Aptos_MSFontService, Calibri, Helvetica, sans-serif; font-size: 11pt; color: rgb(0, 0, 0);" class="elementToProof">
<br>
</div>
<div style="font-family: Aptos, Aptos_EmbeddedFont, Aptos_MSFontService, Calibri, Helvetica, sans-serif; font-size: 11pt; color: rgb(0, 0, 0);" class="elementToProof">
Thanks for your review, let me try to clarify why I added drm_sched_wakeup() to drm_sched_entity_kill_jobs_work():</div>
<div style="font-family: Aptos, Aptos_EmbeddedFont, Aptos_MSFontService, Calibri, Helvetica, sans-serif; font-size: 11pt; color: rgb(0, 0, 0);" class="elementToProof">
<br>
</div>
<div style="font-family: Aptos, Aptos_EmbeddedFont, Aptos_MSFontService, Calibri, Helvetica, sans-serif; font-size: 11pt; color: rgb(0, 0, 0);" class="elementToProof">
When application A submits jobs (a1, a2, a3) and application B submits job b1 with a dependency on a1's scheduled fence, the normal execution flow is (function drm_sched_run_job_work()):</div>
<div style="margin-left: 40px; font-family: Aptos, Aptos_EmbeddedFont, Aptos_MSFontService, Calibri, Helvetica, sans-serif; font-size: 11pt; color: rgb(0, 0, 0);" class="elementToProof">
1. a1 gets popped from the entity by the scheduler</div>
<div style="margin-left: 40px; font-family: Aptos, Aptos_EmbeddedFont, Aptos_MSFontService, Calibri, Helvetica, sans-serif; font-size: 11pt; color: rgb(0, 0, 0);" class="elementToProof">
2. run_job(a1) executes</div>
<div style="margin-left: 40px; font-family: Aptos, Aptos_EmbeddedFont, Aptos_MSFontService, Calibri, Helvetica, sans-serif; font-size: 11pt; color: rgb(0, 0, 0);" class="elementToProof">
3. a1's scheduled fence gets signaled </div>
<div style="margin-left: 40px; font-family: Aptos, Aptos_EmbeddedFont, Aptos_MSFontService, Calibri, Helvetica, sans-serif; font-size: 11pt; color: rgb(0, 0, 0);" class="elementToProof">
4. drm_sched_run_job_work() calls drm_sched_run_job_queue() at the end</div>
<div style="margin-left: 40px; font-family: Aptos, Aptos_EmbeddedFont, Aptos_MSFontService, Calibri, Helvetica, sans-serif; font-size: 11pt; color: rgb(0, 0, 0);" class="elementToProof">
5. The scheduler wakes up and re-selects entities to pop jobs</div>
<div style="margin-left: 40px; font-family: Aptos, Aptos_EmbeddedFont, Aptos_MSFontService, Calibri, Helvetica, sans-serif; font-size: 11pt; color: rgb(0, 0, 0);" class="elementToProof">
6. Since b1's dependency is cleared, the scheduler can select b1 and continue the same flow</div>
<div style="font-family: Aptos, Aptos_EmbeddedFont, Aptos_MSFontService, Calibri, Helvetica, sans-serif; font-size: 11pt; color: rgb(0, 0, 0);" class="elementToProof">
<br>
</div>
<div style="font-family: Aptos, Aptos_EmbeddedFont, Aptos_MSFontService, Calibri, Helvetica, sans-serif; font-size: 11pt; color: rgb(0, 0, 0);" class="elementToProof">
However, if application A is killed before a1 gets popped by the scheduler, then a1, a2, a3 are killed sequentially by drm_sched_entity_kill_jobs_cb(). During the kill process, their scheduled fences are still signaled, but the kill process itself lacks work_run_job.
This means b1's dependency gets cleared, but there's no work_run_job to drive the scheduler to continue running, or we can say
<b>step 4 in the normal flow is missing</b>, and the scheduler enters sleep state, which causes application B to hang.</div>
<div style="font-family: Aptos, Aptos_EmbeddedFont, Aptos_MSFontService, Calibri, Helvetica, sans-serif; font-size: 11pt; color: rgb(0, 0, 0);" class="elementToProof">
So if we add drm_sched_wakeup() in drm_sched_entity_kill_jobs_work() after drm_sched_fence_scheduled(), the scheduler can be woken up, and application B can continue running after a1's scheduled fence is force signaled.</div>
<div id="appendonsend"></div>
<div><br>
</div>
<div style="font-family: Aptos, Aptos_EmbeddedFont, Aptos_MSFontService, Calibri, Helvetica, sans-serif; font-size: 11pt; color: rgb(0, 0, 0);" class="elementToProof">
Thanks,</div>
<div style="font-family: Aptos, Aptos_EmbeddedFont, Aptos_MSFontService, Calibri, Helvetica, sans-serif; font-size: 11pt; color: rgb(0, 0, 0);" class="elementToProof">
Lin</div>
<div style="font-family: Calibri, Arial, Helvetica, sans-serif; font-size: 12pt; color: rgb(0, 0, 0);">
<br>
</div>
<hr style="display: inline-block; width: 98%;">
<div style="font-family: Calibri, Arial, Helvetica, sans-serif; font-size: 12pt; color: rgb(0, 0, 0);">
<b>From:</b> Philipp Stanner <phasta@mailbox.org><br>
<b>Sent:</b> Tuesday, July 15, 2025 17:12<br>
<b>To:</b> cao, lin <lin.cao@amd.com>; Koenig, Christian <Christian.Koenig@amd.com>; phasta@kernel.org <phasta@kernel.org>; dri-devel@lists.freedesktop.org <dri-devel@lists.freedesktop.org><br>
<b>Cc:</b> Yin, ZhenGuo (Chris) <ZhenGuo.Yin@amd.com>; Deng, Emily <Emily.Deng@amd.com>; dakr@kernel.org <dakr@kernel.org>; matthew.brost@intel.com <matthew.brost@intel.com><br>
<b>Subject:</b> Re: [PATCH v2] drm/scheduler: Fix sched hang when killing app with dependent jobs
</div>
<div style="font-family: Calibri, Arial, Helvetica, sans-serif; font-size: 12pt; color: rgb(0, 0, 0);">
<br>
</div>
<div style="font-size: 11pt;">On Tue, 2025-07-15 at 03:38 +0000, cao, lin wrote:<br>
><br>
> [AMD Official Use Only - AMD Internal Distribution Only]<br>
><br>
><br>
><br>
> Hi Christian, Philipp,<br>
><br>
><br>
> I modified the commit msg, and I hope it makes more sense:<br>
><br>
><br>
> drm/sched: wake up scheduler when killing jobs to prevent hang<br>
<br>
nit:<br>
s/wake/Wake<br>
<br>
><br>
><br>
> When application A submits jobs (a1, a2, a3) and application B submits<br>
> job b1 with a dependency on a2's scheduler fence, killing application A<br>
> before run_job(a1) causes drm_sched_entity_kill_jobs_work() to force<br>
> signal all jobs sequentially. However, the optimization in<br>
> drm_sched_entity_add_dependency_cb() waits for the fence to be scheduled<br>
> by adding a callback (drm_sched_entity_clear_dep) instead of immediately<br>
> waking up the scheduler. When A is killed before its jobs can run, the<br>
> callback gets triggered but drm_sched_entity_clear_dep() only clears the<br>
> dependency without waking up the scheduler, causing the scheduler to enter<br>
> sleep state and application B to hang.<br>
<br>
That now reads as if drm_sched_entity_clear_dep() is supposed to wake<br>
up the scheduler. But you add the wakeup to a different function (the<br>
work item).<br>
<br>
Also the code actually looks like that:<br>
<br>
<br>
xa_erase(&job->dependencies, index);<br>
if (f && !dma_fence_add_callback(f, &job->finish_cb,<br>
drm_sched_entity_kill_jobs_cb))<br>
return;<br>
<br>
dma_fence_put(f);<br>
}<br>
<br>
INIT_WORK(&job->work, drm_sched_entity_kill_jobs_work);<br>
schedule_work(&job->work);<br>
}<br>
<br>
So if adding that callback succeeds we.. return immediately but we.. do<br>
that for the first dependency, not for all of them?<br>
<br>
Oh boy. That code base. We(tm) should look into pimping that up. Also<br>
lacks documentation everywhere.<br>
<br>
<br>
Anyways. If this solves a bug for you the patch looks fine (i.e., not<br>
potentially harmful) by me and if the tests succeed we can merge it.<br>
However, I'd feel better if you could clarify more why that function is<br>
the right place to solve the bug.<br>
<br>
<br>
Thanks,<br>
P.<br>
<br>
<br>
><br>
><br>
> Add drm_sched_wakeup() in entity_kill_job_work() to prevent scheduler<br>
> sleep and subsequent application hangs.<br>
><br>
><br>
> v2:<br>
> - Move drm_sched_wakeup() to after drm_sched_fence_scheduled()<br>
><br>
><br>
> Thanks,<br>
> Lin<br>
><br>
><br>
> From: Koenig, Christian <Christian.Koenig@amd.com><br>
> Sent: Monday, July 14, 2025 21:39<br>
> To: phasta@kernel.org <phasta@kernel.org>; cao, lin <lin.cao@amd.com>; dri-devel@lists.freedesktop.org <dri-devel@lists.freedesktop.org><br>
> Cc: Yin, ZhenGuo (Chris) <ZhenGuo.Yin@amd.com>; Deng, Emily <Emily.Deng@amd.com>; dakr@kernel.org <dakr@kernel.org>; matthew.brost@intel.com <matthew.brost@intel.com><br>
> Subject: Re: [PATCH v2] drm/scheduler: Fix sched hang when killing app with dependent jobs<br>
><br>
> <br>
><br>
><br>
> On 14.07.25 15:27, Philipp Stanner wrote:<br>
> > On Mon, 2025-07-14 at 15:08 +0200, Christian König wrote:<br>
> > ><br>
> > ><br>
> > > On 14.07.25 14:46, Philipp Stanner wrote:<br>
> > > > regarding the patch subject: the prefix we use for the scheduler<br>
> > > > is:<br>
> > > > drm/sched:<br>
> > > ><br>
> > > ><br>
> > > > On Mon, 2025-07-14 at 14:23 +0800, Lin.Cao wrote:<br>
> > > ><br>
> > > > > When Application A submits jobs (a1, a2, a3) and application B<br>
> > > > > submits<br>
> > > ><br>
> > > > s/Application/application<br>
> > > ><br>
> > > > > job b1 with a dependency on a2's scheduler fence, killing<br>
> > > > > application A<br>
> > > > > before run_job(a1) causes drm_sched_entity_kill_jobs_work() to<br>
> > > > > force<br>
> > > > > signal all jobs sequentially. However, due to missing<br>
> > > > > work_run_job or<br>
> > > > > work_free_job<br>
> > > > ><br>
> > > ><br>
> > > > You probably mean "due to the work items work_run_job and<br>
> > > > work_free_job<br>
> > > > not being started […]".<br>
> > > ><br>
> > > > However, the call you add, drm_sched_wakeup(), immediately only<br>
> > > > triggers work_run_job. It's not clear to me why you mention<br>
> > > > work_free_job, because drm_sched_entity_kill_jobs_work() directly<br>
> > > > invokes the free_job() callback.<br>
> > ><br>
> > > Yeah the description is rather confusing, took me more than one try<br>
> > > to understand what's going on here as well. Let me try to explain it<br>
> > > in my words:<br>
> > ><br>
> > > When an application A is killed the pending submissions from it are<br>
> > > not executed, but rather torn down by<br>
> > > drm_sched_entity_kill_jobs_work().<br>
> > ><br>
> > > If now a submission from application B depends on something<br>
> > > application A wanted to do on the same scheduler instance the<br>
> > > function drm_sched_entity_add_dependency_cb() would have optimized<br>
> > > this dependency and assumed that the scheduler work would already run<br>
> > > and try to pick a job.<br>
> > ><br>
> > > But that isn't the case when the submissions from application A are<br>
> > > all killed. So make sure to start the scheduler work from<br>
> > > drm_sched_entity_kill_jobs_work() to handle that case.<br>
> ><br>
> > Alright, so the bug then depends on B setting a dependency on A _after_<br>
> > A was killed, doesn't it? Because the optimization in<br>
> > _add_dependency_cb() can only have an effect after A's jobs have been<br>
> > torn down.<br>
><br>
> No, application A and application B submit right behind each other. Application A is then killed later on.<br>
><br>
> The optimization in _add_dependency_cb() just waits for As submission to be handled by the scheduler, but that never happens because it was killed.<br>
><br>
> > If there is such a timing order problem, the commit message should<br>
> > mention it.<br>
> ><br>
> > The commit message definitely also needs to mention this optimization<br>
> > in drm_sched_entity_add_dependency_cb() since it's essential for<br>
> > understanding the bug.<br>
><br>
> +1<br>
><br>
> Christian.<br>
><br>
> ><br>
> ><br>
> > Danke<br>
> > P.<br>
> ><br>
> ><br>
> > ><br>
> > > Regards,<br>
> > > Christian.<br>
> > ><br>
> > > ><br>
> > > > > in entity_kill_job_work(), the scheduler enters sleep<br>
> > > > > state, causing application B hang.<br>
> > > ><br>
> > > > So the issue is that if a1 never ran, the scheduler will not<br>
> > > > continue<br>
> > > > with the jobs of application B, correct?<br>
> > > ><br>
> > > > ><br>
> > > > > Add drm_sched_wakeup() when entity_kill_job_work() to preventing<br>
> > > ><br>
> > > > s/to preventing/to prevent<br>
> > > ><br>
> > > > > scheduler sleep and subsequent application hangs.<br>
> > > > ><br>
> > > > > v2:<br>
> > > > > - Move drm_sched_wakeup() to after drm_sched_fence_scheduled()<br>
> > > ><br>
> > > > Changelogs are cool and useful, but move them below the Signed-off<br>
> > > > area<br>
> > > > with another --- please, like so:<br>
> > > ><br>
> > > > Signed-off by: …<br>
> > > > ---<br>
> > > > v2:<br>
> > > > …<br>
> > > > ---<br>
> > > > drivers/gpu/drm/scheduler/sched_entity.c | 1 +<br>
> > > ><br>
> > > ><br>
> > > > ><br>
> > > > > Signed-off-by: Lin.Cao <lincao12@amd.com><br>
> > > ><br>
> > > > This fixes a bug. Even a racy bug, it seems. So we need a Fixes tag<br>
> > > > and<br>
> > > > put the stable kernel in CC.<br>
> > > ><br>
> > > > Please figure out with git blame, git log and git tag --contains<br>
> > > > which<br>
> > > > commit your patch fixes and which is the oldest kernel that<br>
> > > > contains<br>
> > > > the bug. Then add a Fixes: tag and Cc: the stable kernel. You'll<br>
> > > > find<br>
> > > > lots of examples for that in git log.<br>
> > > ><br>
> > > ><br>
> > > > Thx<br>
> > > > P.<br>
> > > ><br>
> > > > > ---<br>
> > > > > drivers/gpu/drm/scheduler/sched_entity.c | 1 +<br>
> > > > > 1 file changed, 1 insertion(+)<br>
> > > > ><br>
> > > > > diff --git a/drivers/gpu/drm/scheduler/sched_entity.c<br>
> > > > > b/drivers/gpu/drm/scheduler/sched_entity.c<br>
> > > > > index e671aa241720..66f2a43c58fd 100644<br>
> > > > > --- a/drivers/gpu/drm/scheduler/sched_entity.c<br>
> > > > > +++ b/drivers/gpu/drm/scheduler/sched_entity.c<br>
> > > > > @@ -177,6 +177,7 @@ static void<br>
> > > > > drm_sched_entity_kill_jobs_work(struct work_struct *wrk)<br>
> > > > > struct drm_sched_job *job = container_of(wrk,<br>
> > > > > typeof(*job), work);<br>
> > > > > <br>
> > > > > drm_sched_fence_scheduled(job->s_fence, NULL);<br>
> > > > > + drm_sched_wakeup(job->sched);<br>
> > > > > drm_sched_fence_finished(job->s_fence, -ESRCH);<br>
> > > > > WARN_ON(job->s_fence->parent);<br>
> > > > > job->sched->ops->free_job(job);<br>
> > > ><br>
> > ><br>
> ><br>
><br>
<br>
</div>
</div>
</body>
</html>