[RFC PATCH 08/10] dma-buf/dma-fence: Introduce long-running completion fences

Tue Apr 4 12:54:50 UTC 2023

Hi, Christian,

On 4/4/23 11:09, Christian König wrote:
> Am 04.04.23 um 02:22 schrieb Matthew Brost:
>> From: Thomas Hellström <thomas.hellstrom at linux.intel.com>
>>
>> For long-running workloads, drivers either need to open-code completion
>> waits, invent their own synchronization primitives or internally use
>> dma-fences that do not obey the cross-driver dma-fence protocol, but
>> without any lockdep annotation all these approaches are error prone.
>>
>> So since for example the drm scheduler uses dma-fences it is 
>> desirable for
>> a driver to be able to use it for throttling and error handling also 
>> with
>> internal dma-fences tha do not obey the cros-driver dma-fence protocol.
>>
>> Introduce long-running completion fences in form of dma-fences, and add
>> lockdep annotation for them. In particular:
>>
>> * Do not allow waiting under any memory management locks.
>> * Do not allow to attach them to a dma-resv object.
>> * Introduce a new interface for adding callbacks making the helper 
>> adding
>>    a callback sign off on that it is aware that the dma-fence may not
>>    complete anytime soon. Typically this will be the scheduler chaining
>>    a new long-running fence on another one.
>
> Well that's pretty much what I tried before: 
> https://lwn.net/Articles/893704/
>
> And the reasons why it was rejected haven't changed.
>
> Regards,
> Christian.
>
Yes, TBH this was mostly to get discussion going how we'd best tackle 
this problem while being able to reuse the scheduler for long-running 
workloads.

I couldn't see any clear decision on your series, though, but one main 
difference I see is that this is intended for driver-internal use only. 
(I'm counting using the drm_scheduler as a helper for driver-private 
use). This is by no means a way to try tackle the indefinite fence problem.

We could ofc invent a completely different data-type that abstracts the 
synchronization the scheduler needs in the long-running case, or each 
driver could hack something up, like sleeping in the prepare_job() or 
run_job() callback for throttling, but those waits should still be 
annotated in one way or annotated one way or another (and probably in a 
similar way across drivers) to make sure we don't do anything bad.

  So any suggestions as to what would be the better solution here would 
be appreciated.

Thanks,

Thomas