[RFC PATCH 08/10] dma-buf/dma-fence: Introduce long-running completion fences

Wed Apr 5 14:08:06 UTC 2023

Am 05.04.23 um 14:45 schrieb Daniel Vetter:
> On Wed, Apr 05, 2023 at 02:39:35PM +0200, Christian König wrote:
>> Am 05.04.23 um 14:35 schrieb Thomas Hellström:
>>> Hi,
>>>
>>> On 4/4/23 21:25, Daniel Vetter wrote:
>>>> On Tue, Apr 04, 2023 at 07:02:23PM +0000, Matthew Brost wrote:
>>>>> On Tue, Apr 04, 2023 at 08:14:01PM +0200, Thomas Hellström
>>>>> (Intel) wrote:
>>>>>> On 4/4/23 15:10, Christian König wrote:
>>>>>>> Am 04.04.23 um 14:54 schrieb Thomas Hellström:
>>>>>>>> Hi, Christian,
>>>>>>>>
>>>>>>>> On 4/4/23 11:09, Christian König wrote:
>>>>>>>>> Am 04.04.23 um 02:22 schrieb Matthew Brost:
>>>>>>>>>> From: Thomas Hellström <thomas.hellstrom at linux.intel.com>
>>>>>>>>>>
>>>>>>>>>> For long-running workloads, drivers either need to open-code
>>>>>>>>>> completion
>>>>>>>>>> waits, invent their own synchronization
>>>>>>>>>> primitives or internally use
>>>>>>>>>> dma-fences that do not obey the cross-driver
>>>>>>>>>> dma-fence protocol, but
>>>>>>>>>> without any lockdep annotation all these
>>>>>>>>>> approaches are error prone.
>>>>>>>>>>
>>>>>>>>>> So since for example the drm scheduler uses dma-fences it is
>>>>>>>>>> desirable for
>>>>>>>>>> a driver to be able to use it for throttling and error
>>>>>>>>>> handling also with
>>>>>>>>>> internal dma-fences tha do not obey the cros-driver
>>>>>>>>>> dma-fence protocol.
>>>>>>>>>>
>>>>>>>>>> Introduce long-running completion fences in form of
>>>>>>>>>> dma-fences, and add
>>>>>>>>>> lockdep annotation for them. In particular:
>>>>>>>>>>
>>>>>>>>>> * Do not allow waiting under any memory management locks.
>>>>>>>>>> * Do not allow to attach them to a dma-resv object.
>>>>>>>>>> * Introduce a new interface for adding callbacks making the
>>>>>>>>>> helper adding
>>>>>>>>>>      a callback sign off on that it is aware
>>>>>>>>>> that the dma-fence may not
>>>>>>>>>>      complete anytime soon. Typically this will be the
>>>>>>>>>> scheduler chaining
>>>>>>>>>>      a new long-running fence on another one.
>>>>>>>>> Well that's pretty much what I tried before:
>>>>>>>>> https://lwn.net/Articles/893704/
>>>>>>>>>
>>>>> I don't think this quite the same, this explictly enforces that
>>>>> we don't
>>>>> break the dma-fence rules (in path of memory allocations, exported in
>>>>> any way), essentially this just SW sync point reusing dma-fence the
>>>>> infrastructure for signaling / callbacks. I believe your series
>>>>> tried to
>>>>> export these fences to user space (admittedly I haven't fully read your
>>>>> series).
>>>>>
>>>>> In this use case we essentially just want to flow control the ring via
>>>>> the dma-scheduler + maintain a list of pending jobs so the TDR can be
>>>>> used for cleanup if LR entity encounters an error. To me this seems
>>>>> perfectly reasonable but I know dma-femce rules are akin to a holy war.
>>>>>
>>>>> If we return NULL in run_job, now we have to be able to sink all jobs
>>>>> in the backend regardless on ring space, maintain a list of jobs
>>>>> pending
>>>>> for cleanup after errors, and write a different cleanup path as now the
>>>>> TDR doesn't work. Seems very, very silly to duplicate all of this code
>>>>> when the DRM scheduler provides all of this for us. Also if we go this
>>>>> route, now all drivers are going to invent ways to handle LR
>>>>> jobs /w the
>>>>> DRM scheduler.
>>>>>
>>>>> This solution is pretty clear, mark the scheduler as LR, and don't
>>>>> export any fences from the scheduler. If you try to export these fences
>>>>> a blow up happens.
>>>> The problem is if you mix things up. Like for resets you need all the
>>>> schedulers on an engine/set-of-engines to quiescent or things get
>>>> potentially hilarious. If you now have a scheduler in forever limbo, the
>>>> dma_fence guarantees are right out the window.
>>>>
>>>> But the issue you're having is fairly specific if it's just about
>>>> ringspace. I think the dumbest fix is to just block in submit if you run
>>>> out of per-ctx ringspace, and call it a day. This notion that
>>>> somehow the
>>>> kernel is supposed to provide a bottomless queue of anything userspace
>>>> submits simply doesn't hold up in reality (as much as userspace
>>>> standards
>>>> committees would like it to), and as long as it doesn't have a
>>>> real-world
>>>> perf impact it doesn't really matter why we end up blocking in the
>>>> submit
>>>> ioctl. It might also be a simple memory allocation that hits a snag in
>>>> page reclaim.
>>> So it seems the discussion around the long-running synchronization
>>> diverged a bit between threads and this thread was hijacked for
>>> preempt-fences and userptr.
>>>
>>> Do I understand it correctly that the recommendation from both Daniel
>>> and Christian is to *not* use the drm scheduler for long-running compute
>>> jobs, but track any internal dma-fence dependencies (pipelined clearing
>>> or whatever) in a separate mechanism and handle unresolved dependencies
>>> on other long-running jobs using -EWOULDBLOCK?
>> Yeah, I think that's a good summary.
>>
>> If needed we could extract some scheduler functionality into separate
>> components, but the fundamental problem is that to the GPU scheduler
>> provides a dma_fence interface to the outside to signal job completion and
>> Daniel and I seem to agree that you really don't want that.
> I think I'm on something slightly different:
>
> - For anything which semantically is not a dma_fence I agree it probably
>    should be handled with EWOULDBLOCK and passed to userspace. Either with
>    a submit thread or userspace memory fences. Note that in practice you
>    will have a bunch of blocking left in the ioctl, stuff like mutexes or
>    memory allocations when things get really tight and you end up in
>    synchronous reclaim. Not any different from userspace ending up in
>    synchronous reclaim due to a page fault really. Trying to shoehorn
>    userspace memory fences or anything else long-running into drm/sched
>    dependency handling is just way too much a can of worms.
>
> - For the memory management dependencies, which are all dma_fence when
>    pipeline, I do think pushing them through the drm/sched makes sense. It
>    has all the stuff to handle that already, plus it's imo also the ideal
>    place to handle the preempt-ctx dma_fence scaffolding/semantics. Which
>    would give you a really neatly unified command submission interface
>    since in both cases (end-of-batch and long-running) you fish the
>    dma_fence you need to stuff in all the right dma_resv object (for memory
>    management purpose) out of the same place: The drm_sched_job struct.
>
> So I'm _not_ on the "do not use drm/sched for long-running jobs at all".
> That doesn't make much sense to me because you'll just reinventing the
> exact same dma_fence dependency handling and memory management shuffling
> we already have. That seems silly.

How about we stuff the functionality we still want to have into a 
drm_job object?

I mean that really isn't that much, basically just looking at 
drm_syncobj, dma_resv etc.. and extracting all the dependencies.

Christian.

> -Daniel
>
>> Regards,
>> Christian.
>>
>>> Thanks,
>>> Thomas
>>>
>>>
>>>
>>>
>>>
>>>>>>>>> And the reasons why it was rejected haven't changed.
>>>>>>>>>
>>>>>>>>> Regards,
>>>>>>>>> Christian.
>>>>>>>>>
>>>>>>>> Yes, TBH this was mostly to get discussion going how we'd best
>>>>>>>> tackle this problem while being able to reuse the scheduler for
>>>>>>>> long-running workloads.
>>>>>>>>
>>>>>>>> I couldn't see any clear decision on your series, though, but one
>>>>>>>> main difference I see is that this is intended for driver-internal
>>>>>>>> use only. (I'm counting using the drm_scheduler as a helper for
>>>>>>>> driver-private use). This is by no means a way to try tackle the
>>>>>>>> indefinite fence problem.
>>>>>>> Well this was just my latest try to tackle this, but essentially the
>>>>>>> problems are the same as with your approach: When we express such
>>>>>>> operations as dma_fence there is always the change that we leak that
>>>>>>> somewhere.
>>>>>>>
>>>>>>> My approach of adding a flag noting that this operation
>>>>>>> is dangerous and
>>>>>>> can't be synced with something memory management depends on tried to
>>>>>>> contain this as much as possible, but Daniel still pretty clearly
>>>>>>> rejected it (for good reasons I think).
>>>>>>>
>>>>>>>> We could ofc invent a completely different data-type that abstracts
>>>>>>>> the synchronization the scheduler needs in the long-running case, or
>>>>>>>> each driver could hack something up, like sleeping in the
>>>>>>>> prepare_job() or run_job() callback for throttling, but those waits
>>>>>>>> should still be annotated in one way or annotated one way or another
>>>>>>>> (and probably in a similar way across drivers) to make sure we don't
>>>>>>>> do anything bad.
>>>>>>>>
>>>>>>>>    So any suggestions as to what would be the better solution here
>>>>>>>> would be appreciated.
>>>>>>> Mhm, do we really the the GPU scheduler for that?
>>>>>>>
>>>>> I think we need to solve this within the DRM scheduler one way or
>>>>> another.
>>>> Yeah so if we conclude that the queue really must be bottomless then I
>>>> agree drm-sched should help out sort out the mess. Because I'm guessing
>>>> that every driver will have this issue. But that's a big if.
>>>>
>>>> I guess if we teach the drm scheduler that some jobs are fairly endless
>>>> then maybe it wouldn't be too far-fetched to also teach it to wait for a
>>>> previous one to finish (but not with the dma_fence that preempts,
>>>> which we
>>>> put into the dma_resv for memory management, but some other struct
>>>> completion). The scheduler already has a concept of not stuffing too
>>>> much
>>>> stuff into the same queue after all, so this should fit?
>>>> -Daniel
>>>>
>>>>
>>>>>>> I mean in the 1 to 1 case  you basically just need a component which
>>>>>>> collects the dependencies as dma_fence and if all of
>>>>>>> them are fulfilled
>>>>>>> schedules a work item.
>>>>>>>
>>>>>>> As long as the work item itself doesn't produce a
>>>>>>> dma_fence it can then
>>>>>>> still just wait for other none dma_fence dependencies.
>>>>>>>
>>>>>>> Then the work function could submit the work and wait for the result.
>>>>>>>
>>>>>>> The work item would then pretty much represent what you want, you can
>>>>>>> wait for it to finish and pass it along as long running dependency.
>>>>>>>
>>>>>>> Maybe give it a funky name and wrap it up in a structure, but that's
>>>>>>> basically it.
>>>>>>>
>>>>>> This very much sounds like a i915_sw_fence for the
>>>>>> dependency tracking and
>>>>>> dma_fence_work for the actual work although it's completion fence is a
>>>>>> dma_fence.
>>>>>>
>>>>> Agree this does sound to i915ish as stated below one of mandates in Xe
>>>>> was to use the DRM scheduler. Beyond that as someone who a submission
>>>>> backend in the i915 and Xe, I love how the DRM scheduler works (single
>>>>> entry point), it makes everything so much easier.
>>>>>
>>>>> Matt
>>>>>
>>>>>> Although that goes against the whole idea of a condition for
>>>>>> merging the xe
>>>>>> driver would be that we implement some sort of minimal scaffolding for
>>>>>> long-running workloads in the drm scheduler, and the
>>>>>> thinking behind that is
>>>>>> to avoid implementing intel-specific solutions like those...
>>>>>>
>>>>>> Thanks,
>>>>>>
>>>>>> Thomas
>>>>>>
>>>>>>
>>>>>>
>>>>>>> Regards,
>>>>>>> Christian.
>>>>>>>
>>>>>>>> Thanks,
>>>>>>>>
>>>>>>>> Thomas
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>