[RFC PATCH 08/10] dma-buf/dma-fence: Introduce long-running completion fences

Wed Apr 5 12:39:35 UTC 2023

Am 05.04.23 um 14:35 schrieb Thomas Hellström:
> Hi,
>
> On 4/4/23 21:25, Daniel Vetter wrote:
>> On Tue, Apr 04, 2023 at 07:02:23PM +0000, Matthew Brost wrote:
>>> On Tue, Apr 04, 2023 at 08:14:01PM +0200, Thomas Hellström (Intel) 
>>> wrote:
>>>> On 4/4/23 15:10, Christian König wrote:
>>>>> Am 04.04.23 um 14:54 schrieb Thomas Hellström:
>>>>>> Hi, Christian,
>>>>>>
>>>>>> On 4/4/23 11:09, Christian König wrote:
>>>>>>> Am 04.04.23 um 02:22 schrieb Matthew Brost:
>>>>>>>> From: Thomas Hellström <thomas.hellstrom at linux.intel.com>
>>>>>>>>
>>>>>>>> For long-running workloads, drivers either need to open-code
>>>>>>>> completion
>>>>>>>> waits, invent their own synchronization primitives or 
>>>>>>>> internally use
>>>>>>>> dma-fences that do not obey the cross-driver dma-fence 
>>>>>>>> protocol, but
>>>>>>>> without any lockdep annotation all these approaches are error 
>>>>>>>> prone.
>>>>>>>>
>>>>>>>> So since for example the drm scheduler uses dma-fences it is
>>>>>>>> desirable for
>>>>>>>> a driver to be able to use it for throttling and error
>>>>>>>> handling also with
>>>>>>>> internal dma-fences tha do not obey the cros-driver
>>>>>>>> dma-fence protocol.
>>>>>>>>
>>>>>>>> Introduce long-running completion fences in form of
>>>>>>>> dma-fences, and add
>>>>>>>> lockdep annotation for them. In particular:
>>>>>>>>
>>>>>>>> * Do not allow waiting under any memory management locks.
>>>>>>>> * Do not allow to attach them to a dma-resv object.
>>>>>>>> * Introduce a new interface for adding callbacks making the
>>>>>>>> helper adding
>>>>>>>>     a callback sign off on that it is aware that the dma-fence 
>>>>>>>> may not
>>>>>>>>     complete anytime soon. Typically this will be the
>>>>>>>> scheduler chaining
>>>>>>>>     a new long-running fence on another one.
>>>>>>> Well that's pretty much what I tried before:
>>>>>>> https://lwn.net/Articles/893704/
>>>>>>>
>>> I don't think this quite the same, this explictly enforces that we 
>>> don't
>>> break the dma-fence rules (in path of memory allocations, exported in
>>> any way), essentially this just SW sync point reusing dma-fence the
>>> infrastructure for signaling / callbacks. I believe your series 
>>> tried to
>>> export these fences to user space (admittedly I haven't fully read your
>>> series).
>>>
>>> In this use case we essentially just want to flow control the ring via
>>> the dma-scheduler + maintain a list of pending jobs so the TDR can be
>>> used for cleanup if LR entity encounters an error. To me this seems
>>> perfectly reasonable but I know dma-femce rules are akin to a holy war.
>>>
>>> If we return NULL in run_job, now we have to be able to sink all jobs
>>> in the backend regardless on ring space, maintain a list of jobs 
>>> pending
>>> for cleanup after errors, and write a different cleanup path as now the
>>> TDR doesn't work. Seems very, very silly to duplicate all of this code
>>> when the DRM scheduler provides all of this for us. Also if we go this
>>> route, now all drivers are going to invent ways to handle LR jobs /w 
>>> the
>>> DRM scheduler.
>>>
>>> This solution is pretty clear, mark the scheduler as LR, and don't
>>> export any fences from the scheduler. If you try to export these fences
>>> a blow up happens.
>> The problem is if you mix things up. Like for resets you need all the
>> schedulers on an engine/set-of-engines to quiescent or things get
>> potentially hilarious. If you now have a scheduler in forever limbo, the
>> dma_fence guarantees are right out the window.
>>
>> But the issue you're having is fairly specific if it's just about
>> ringspace. I think the dumbest fix is to just block in submit if you run
>> out of per-ctx ringspace, and call it a day. This notion that somehow 
>> the
>> kernel is supposed to provide a bottomless queue of anything userspace
>> submits simply doesn't hold up in reality (as much as userspace 
>> standards
>> committees would like it to), and as long as it doesn't have a 
>> real-world
>> perf impact it doesn't really matter why we end up blocking in the 
>> submit
>> ioctl. It might also be a simple memory allocation that hits a snag in
>> page reclaim.
>
> So it seems the discussion around the long-running synchronization 
> diverged a bit between threads and this thread was hijacked for 
> preempt-fences and userptr.
>
> Do I understand it correctly that the recommendation from both Daniel 
> and Christian is to *not* use the drm scheduler for long-running 
> compute jobs, but track any internal dma-fence dependencies (pipelined 
> clearing or whatever) in a separate mechanism and handle unresolved 
> dependencies on other long-running jobs using -EWOULDBLOCK?

Yeah, I think that's a good summary.

If needed we could extract some scheduler functionality into separate 
components, but the fundamental problem is that to the GPU scheduler 
provides a dma_fence interface to the outside to signal job completion 
and Daniel and I seem to agree that you really don't want that.

Regards,
Christian.

>
> Thanks,
> Thomas
>
>
>
>
>
>>>>>>> And the reasons why it was rejected haven't changed.
>>>>>>>
>>>>>>> Regards,
>>>>>>> Christian.
>>>>>>>
>>>>>> Yes, TBH this was mostly to get discussion going how we'd best
>>>>>> tackle this problem while being able to reuse the scheduler for
>>>>>> long-running workloads.
>>>>>>
>>>>>> I couldn't see any clear decision on your series, though, but one
>>>>>> main difference I see is that this is intended for driver-internal
>>>>>> use only. (I'm counting using the drm_scheduler as a helper for
>>>>>> driver-private use). This is by no means a way to try tackle the
>>>>>> indefinite fence problem.
>>>>> Well this was just my latest try to tackle this, but essentially the
>>>>> problems are the same as with your approach: When we express such
>>>>> operations as dma_fence there is always the change that we leak that
>>>>> somewhere.
>>>>>
>>>>> My approach of adding a flag noting that this operation is 
>>>>> dangerous and
>>>>> can't be synced with something memory management depends on tried to
>>>>> contain this as much as possible, but Daniel still pretty clearly
>>>>> rejected it (for good reasons I think).
>>>>>
>>>>>> We could ofc invent a completely different data-type that abstracts
>>>>>> the synchronization the scheduler needs in the long-running case, or
>>>>>> each driver could hack something up, like sleeping in the
>>>>>> prepare_job() or run_job() callback for throttling, but those waits
>>>>>> should still be annotated in one way or annotated one way or another
>>>>>> (and probably in a similar way across drivers) to make sure we don't
>>>>>> do anything bad.
>>>>>>
>>>>>>   So any suggestions as to what would be the better solution here
>>>>>> would be appreciated.
>>>>> Mhm, do we really the the GPU scheduler for that?
>>>>>
>>> I think we need to solve this within the DRM scheduler one way or
>>> another.
>> Yeah so if we conclude that the queue really must be bottomless then I
>> agree drm-sched should help out sort out the mess. Because I'm guessing
>> that every driver will have this issue. But that's a big if.
>>
>> I guess if we teach the drm scheduler that some jobs are fairly endless
>> then maybe it wouldn't be too far-fetched to also teach it to wait for a
>> previous one to finish (but not with the dma_fence that preempts, 
>> which we
>> put into the dma_resv for memory management, but some other struct
>> completion). The scheduler already has a concept of not stuffing too 
>> much
>> stuff into the same queue after all, so this should fit?
>> -Daniel
>>
>>
>>>>> I mean in the 1 to 1 case  you basically just need a component which
>>>>> collects the dependencies as dma_fence and if all of them are 
>>>>> fulfilled
>>>>> schedules a work item.
>>>>>
>>>>> As long as the work item itself doesn't produce a dma_fence it can 
>>>>> then
>>>>> still just wait for other none dma_fence dependencies.
>>>>>
>>>>> Then the work function could submit the work and wait for the result.
>>>>>
>>>>> The work item would then pretty much represent what you want, you can
>>>>> wait for it to finish and pass it along as long running dependency.
>>>>>
>>>>> Maybe give it a funky name and wrap it up in a structure, but that's
>>>>> basically it.
>>>>>
>>>> This very much sounds like a i915_sw_fence for the dependency 
>>>> tracking and
>>>> dma_fence_work for the actual work although it's completion fence is a
>>>> dma_fence.
>>>>
>>> Agree this does sound to i915ish as stated below one of mandates in Xe
>>> was to use the DRM scheduler. Beyond that as someone who a submission
>>> backend in the i915 and Xe, I love how the DRM scheduler works (single
>>> entry point), it makes everything so much easier.
>>>
>>> Matt
>>>
>>>> Although that goes against the whole idea of a condition for 
>>>> merging the xe
>>>> driver would be that we implement some sort of minimal scaffolding for
>>>> long-running workloads in the drm scheduler, and the thinking 
>>>> behind that is
>>>> to avoid implementing intel-specific solutions like those...
>>>>
>>>> Thanks,
>>>>
>>>> Thomas
>>>>
>>>>
>>>>
>>>>> Regards,
>>>>> Christian.
>>>>>
>>>>>> Thanks,
>>>>>>
>>>>>> Thomas
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>