[RFC PATCH] drm/sched: Wait for the currently popped dependency in kill_jobs_cb()

Boris Brezillon boris.brezillon at collabora.com
Fri Jun 9 14:10:06 UTC 2023


Hello Christian,

On Fri, 9 Jun 2023 13:53:59 +0200
Christian König <christian.koenig at amd.com> wrote:

> Am 08.06.23 um 08:55 schrieb Boris Brezillon:
> > If I understand correctly, drm_sched_entity_kill_jobs_cb() is supposed
> > to wait on all the external dependencies (those added to
> > drm_sched_job::dependencies) before signaling the job finished fence.
> > This is done this way to prevent jobs depending on these canceled jobs
> > from considering the resources they want to access as ready, when
> > they're actually still used by other components, thus leading to
> > potential data corruptions.
> >
> > The problem is, the kill_jobs logic is omitting the last fence popped
> > from the dependencies array that was waited upon before
> > drm_sched_entity_kill() was called (drm_sched_entity::dependency field),
> > so we're basically waiting for all dependencies except one.
> >
> > This is an attempt at fixing that, but also an opportunity to make sure
> > I understood the drm_sched_entity_kill(), because I'm not 100% sure if
> > skipping this currently popped dependency was intentional or not. I can't
> > see a good reason why we'd want to skip that one, but I might be missing
> > something.
> >
> > Signed-off-by: Boris Brezillon <boris.brezillon at collabora.com>
> > Cc: Frank Binns <frank.binns at imgtec.com>
> > Cc: Sarah Walker <sarah.walker at imgtec.com>
> > Cc: Donald Robson <donald.robson at imgtec.com>
> > Cc: Luben Tuikov <luben.tuikov at amd.com>
> > Cc: David Airlie <airlied at gmail.com>
> > Cc: Daniel Vetter <daniel at ffwll.ch>
> > Cc: Sumit Semwal <sumit.semwal at linaro.org>
> > Cc: "Christian König" <christian.koenig at amd.com>
> > ---
> > Stumbled across this issue while working on native dependency support
> > with Donald (which will be posted soon). Flagged as RFC, because I'm
> > not sure this is legit, and also not sure we want to fix it this way.
> > I tried re-using drm_sched_entity::dependency, but it's a bit of a mess
> > because of the asynchronousity of the wait, and the fact we use
> > drm_sched_entity::dependency to know if we have a clear_dep()
> > callback registered, so we can easily reset it without removing the
> > callback.  
> 
> Well yes, that's a known problem. But this is really not the right 
> approach to fixing this.
> 
> Trying to wait for all the dependencies before killing jobs was added 
> because of the way we kept track of dma_fences in dma_resv objects. 
> Basically adding exclusive fences removed all other fences leading to a 
> bit fragile memory management.

Okay.

> 
> This handling was removed by now and so the workaround for waiting for 
> dependencies is not really necessary any more, but I think it is still 
> better to do so.
> 
> The right approach of getting this waiting for dependencies completely 
> straight is also not to touch entity->dependency in any way, but to stop 
> removing them from the XA in drm_sched_job_dependency(). Otherwise you 
> don't catch the pipeline optimized ones either.

Do you want me to post a v2 doing that, or should I forget about it?
If we decide to keep things like that, it might be good to at least
add a comment explaining why we don't care.

Regards,

Boris


More information about the dri-devel mailing list