[PATCH] drm/sched: Remove optimization that causes hang when killing dependent jobs

Philipp Stanner phasta at mailbox.org
Wed Jul 16 09:57:53 UTC 2025


On Wed, 2025-07-16 at 09:43 +0000, cao, lin wrote:
> 
> [AMD Official Use Only - AMD Internal Distribution Only]
> 
> 
> 
> Hi Philipp,
> 
> 
> Thank you for the review. I found that this optimization was
> introduced 9 years ago in commit
> 777dbd458c89d4ca74a659f85ffb5bc817f29a35 ("drm/amdgpu: drop a dummy
> wakeup scheduler").
> 
> 
> Given that the codebase has undergone significant changes over these
> 9 years. May I ask if I still need to include the Fixes: tag?

Yes. It's a helpful marker to see where the problem comes from, and it
adds redundancy helping the stable-kernel maintainers in figuring out
to which kernels to backport it to.

If stable can't apply a patch to a very old stable kernel because the
code base changed too much, they'll ping us and we might provide a
dedicated fix.

So like that:

Cc: stable at vger.kernel.org # v4.6+
Fixes: 777dbd458c89 ("drm/amdgpu: drop a dummy wakeup scheduler")


P.

> 
> 
> Thanks,
> Lin
> 
> 
> From: Philipp Stanner <phasta at mailbox.org>
> Sent: Wednesday, July 16, 2025 16:33
> To: cao, lin <lin.cao at amd.com>; dri-devel at lists.freedesktop.org
> <dri-devel at lists.freedesktop.org>
> Cc: Yin, ZhenGuo (Chris) <ZhenGuo.Yin at amd.com>; Deng, Emily
> <Emily.Deng at amd.com>; Koenig, Christian <Christian.Koenig at amd.com>;
> phasta at kernel.org <phasta at kernel.org>; dakr at kernel.org
> <dakr at kernel.org>; matthew.brost at intel.com <matthew.brost at intel.com>
> Subject: Re: [PATCH] drm/sched: Remove optimization that causes hang
> when killing dependent jobs
> 
>  
> 
> 
> On Tue, 2025-07-15 at 21:50 +0800, Lin.Cao wrote:
> > When application A submits jobs and application B submits a job
> > with
> > a
> > dependency on A's fence, the normal flow wakes up the scheduler
> > after
> > processing each job. However, the optimization in
> > drm_sched_entity_add_dependency_cb() uses a callback that only
> > clears
> > dependencies without waking up the scheduler.
> > 
> > When application A is killed before its jobs can run, the callback
> > gets
> > triggered but only clears the dependency without waking up the
> > scheduler,
> > causing the scheduler to enter sleep state and application B to
> > hang.
> > 
> > Remove the optimization by deleting drm_sched_entity_clear_dep()
> > and
> > its
> > usage, ensuring the scheduler is always woken up when dependencies
> > are
> > cleared.
> > 
> > Signed-off-by: Lin.Cao <lincao12 at amd.com>
> 
> This is, still, a bug fix, so it needs Fixes: and Cc: stable :)
> 
> Could also include a Suggested-by: Christian
> 
> P.
> 
> > ---
> >  drivers/gpu/drm/scheduler/sched_entity.c | 21 ++------------------
> > -
> >  1 file changed, 2 insertions(+), 19 deletions(-)
> > 
> > diff --git a/drivers/gpu/drm/scheduler/sched_entity.c
> > b/drivers/gpu/drm/scheduler/sched_entity.c
> > index e671aa241720..ac678de7fe5e 100644
> > --- a/drivers/gpu/drm/scheduler/sched_entity.c
> > +++ b/drivers/gpu/drm/scheduler/sched_entity.c
> > @@ -355,17 +355,6 @@ void drm_sched_entity_destroy(struct
> > drm_sched_entity *entity)
> >  }
> >  EXPORT_SYMBOL(drm_sched_entity_destroy);
> >  
> > -/* drm_sched_entity_clear_dep - callback to clear the entities
> > dependency */
> > -static void drm_sched_entity_clear_dep(struct dma_fence *f,
> > -                                    struct dma_fence_cb *cb)
> > -{
> > -     struct drm_sched_entity *entity =
> > -             container_of(cb, struct drm_sched_entity, cb);
> > -
> > -     entity->dependency = NULL;
> > -     dma_fence_put(f);
> > -}
> > -
> >  /*
> >   * drm_sched_entity_wakeup - callback to clear the entity's
> > dependency and
> >   * wake up the scheduler
> > @@ -376,7 +365,8 @@ static void drm_sched_entity_wakeup(struct
> > dma_fence *f,
> >       struct drm_sched_entity *entity =
> >               container_of(cb, struct drm_sched_entity, cb);
> >  
> > -     drm_sched_entity_clear_dep(f, cb);
> > +     entity->dependency = NULL;
> > +     dma_fence_put(f);
> >       drm_sched_wakeup(entity->rq->sched);
> >  }
> >  
> > @@ -429,13 +419,6 @@ static bool
> > drm_sched_entity_add_dependency_cb(struct drm_sched_entity *entity)
> >               fence = dma_fence_get(&s_fence->scheduled);
> >               dma_fence_put(entity->dependency);
> >               entity->dependency = fence;
> > -             if (!dma_fence_add_callback(fence, &entity->cb,
> > -                                        
> > drm_sched_entity_clear_dep))
> > -                     return true;
> > -
> > -             /* Ignore it when it is already scheduled */
> > -             dma_fence_put(fence);
> > -             return false;
> >       }
> >  
> >       if (!dma_fence_add_callback(entity->dependency, &entity->cb,
> 



More information about the dri-devel mailing list