[RFC PATCH] drm/sched: Wait for the currently popped dependency in kill_jobs_cb()

Christian König christian.koenig at amd.com
Fri Jun 9 14:29:35 UTC 2023


Am 09.06.23 um 16:10 schrieb Boris Brezillon:
> Hello Christian,
>
> On Fri, 9 Jun 2023 13:53:59 +0200
> Christian König <christian.koenig at amd.com> wrote:
>
>> Am 08.06.23 um 08:55 schrieb Boris Brezillon:
>>> If I understand correctly, drm_sched_entity_kill_jobs_cb() is supposed
>>> to wait on all the external dependencies (those added to
>>> drm_sched_job::dependencies) before signaling the job finished fence.
>>> This is done this way to prevent jobs depending on these canceled jobs
>>> from considering the resources they want to access as ready, when
>>> they're actually still used by other components, thus leading to
>>> potential data corruptions.
>>>
>>> The problem is, the kill_jobs logic is omitting the last fence popped
>>> from the dependencies array that was waited upon before
>>> drm_sched_entity_kill() was called (drm_sched_entity::dependency field),
>>> so we're basically waiting for all dependencies except one.
>>>
>>> This is an attempt at fixing that, but also an opportunity to make sure
>>> I understood the drm_sched_entity_kill(), because I'm not 100% sure if
>>> skipping this currently popped dependency was intentional or not. I can't
>>> see a good reason why we'd want to skip that one, but I might be missing
>>> something.
>>>
>>> Signed-off-by: Boris Brezillon <boris.brezillon at collabora.com>
>>> Cc: Frank Binns <frank.binns at imgtec.com>
>>> Cc: Sarah Walker <sarah.walker at imgtec.com>
>>> Cc: Donald Robson <donald.robson at imgtec.com>
>>> Cc: Luben Tuikov <luben.tuikov at amd.com>
>>> Cc: David Airlie <airlied at gmail.com>
>>> Cc: Daniel Vetter <daniel at ffwll.ch>
>>> Cc: Sumit Semwal <sumit.semwal at linaro.org>
>>> Cc: "Christian König" <christian.koenig at amd.com>
>>> ---
>>> Stumbled across this issue while working on native dependency support
>>> with Donald (which will be posted soon). Flagged as RFC, because I'm
>>> not sure this is legit, and also not sure we want to fix it this way.
>>> I tried re-using drm_sched_entity::dependency, but it's a bit of a mess
>>> because of the asynchronousity of the wait, and the fact we use
>>> drm_sched_entity::dependency to know if we have a clear_dep()
>>> callback registered, so we can easily reset it without removing the
>>> callback.
>> Well yes, that's a known problem. But this is really not the right
>> approach to fixing this.
>>
>> Trying to wait for all the dependencies before killing jobs was added
>> because of the way we kept track of dma_fences in dma_resv objects.
>> Basically adding exclusive fences removed all other fences leading to a
>> bit fragile memory management.
> Okay.
>
>> This handling was removed by now and so the workaround for waiting for
>> dependencies is not really necessary any more, but I think it is still
>> better to do so.
>>
>> The right approach of getting this waiting for dependencies completely
>> straight is also not to touch entity->dependency in any way, but to stop
>> removing them from the XA in drm_sched_job_dependency(). Otherwise you
>> don't catch the pipeline optimized ones either.
> Do you want me to post a v2 doing that, or should I forget about it?
> If we decide to keep things like that, it might be good to at least
> add a comment explaining why we don't care.

Well, if you have time fixing this fully and keeping the dependency 
handling is certainly be preferred.

Regards,
Christian.

>
> Regards,
>
> Boris



More information about the dri-devel mailing list