[RFC PATCH 08/10] dma-buf/dma-fence: Introduce long-running completion fences

Wed Apr 5 12:35:46 UTC 2023

Hi,

On 4/4/23 21:25, Daniel Vetter wrote:
> On Tue, Apr 04, 2023 at 07:02:23PM +0000, Matthew Brost wrote:
>> On Tue, Apr 04, 2023 at 08:14:01PM +0200, Thomas Hellström (Intel) wrote:
>>> On 4/4/23 15:10, Christian König wrote:
>>>> Am 04.04.23 um 14:54 schrieb Thomas Hellström:
>>>>> Hi, Christian,
>>>>>
>>>>> On 4/4/23 11:09, Christian König wrote:
>>>>>> Am 04.04.23 um 02:22 schrieb Matthew Brost:
>>>>>>> From: Thomas Hellström <thomas.hellstrom at linux.intel.com>
>>>>>>>
>>>>>>> For long-running workloads, drivers either need to open-code
>>>>>>> completion
>>>>>>> waits, invent their own synchronization primitives or internally use
>>>>>>> dma-fences that do not obey the cross-driver dma-fence protocol, but
>>>>>>> without any lockdep annotation all these approaches are error prone.
>>>>>>>
>>>>>>> So since for example the drm scheduler uses dma-fences it is
>>>>>>> desirable for
>>>>>>> a driver to be able to use it for throttling and error
>>>>>>> handling also with
>>>>>>> internal dma-fences tha do not obey the cros-driver
>>>>>>> dma-fence protocol.
>>>>>>>
>>>>>>> Introduce long-running completion fences in form of
>>>>>>> dma-fences, and add
>>>>>>> lockdep annotation for them. In particular:
>>>>>>>
>>>>>>> * Do not allow waiting under any memory management locks.
>>>>>>> * Do not allow to attach them to a dma-resv object.
>>>>>>> * Introduce a new interface for adding callbacks making the
>>>>>>> helper adding
>>>>>>>     a callback sign off on that it is aware that the dma-fence may not
>>>>>>>     complete anytime soon. Typically this will be the
>>>>>>> scheduler chaining
>>>>>>>     a new long-running fence on another one.
>>>>>> Well that's pretty much what I tried before:
>>>>>> https://lwn.net/Articles/893704/
>>>>>>
>> I don't think this quite the same, this explictly enforces that we don't
>> break the dma-fence rules (in path of memory allocations, exported in
>> any way), essentially this just SW sync point reusing dma-fence the
>> infrastructure for signaling / callbacks. I believe your series tried to
>> export these fences to user space (admittedly I haven't fully read your
>> series).
>>
>> In this use case we essentially just want to flow control the ring via
>> the dma-scheduler + maintain a list of pending jobs so the TDR can be
>> used for cleanup if LR entity encounters an error. To me this seems
>> perfectly reasonable but I know dma-femce rules are akin to a holy war.
>>
>> If we return NULL in run_job, now we have to be able to sink all jobs
>> in the backend regardless on ring space, maintain a list of jobs pending
>> for cleanup after errors, and write a different cleanup path as now the
>> TDR doesn't work. Seems very, very silly to duplicate all of this code
>> when the DRM scheduler provides all of this for us. Also if we go this
>> route, now all drivers are going to invent ways to handle LR jobs /w the
>> DRM scheduler.
>>
>> This solution is pretty clear, mark the scheduler as LR, and don't
>> export any fences from the scheduler. If you try to export these fences
>> a blow up happens.
> The problem is if you mix things up. Like for resets you need all the
> schedulers on an engine/set-of-engines to quiescent or things get
> potentially hilarious. If you now have a scheduler in forever limbo, the
> dma_fence guarantees are right out the window.
>
> But the issue you're having is fairly specific if it's just about
> ringspace. I think the dumbest fix is to just block in submit if you run
> out of per-ctx ringspace, and call it a day. This notion that somehow the
> kernel is supposed to provide a bottomless queue of anything userspace
> submits simply doesn't hold up in reality (as much as userspace standards
> committees would like it to), and as long as it doesn't have a real-world
> perf impact it doesn't really matter why we end up blocking in the submit
> ioctl. It might also be a simple memory allocation that hits a snag in
> page reclaim.

So it seems the discussion around the long-running synchronization 
diverged a bit between threads and this thread was hijacked for 
preempt-fences and userptr.

Do I understand it correctly that the recommendation from both Daniel 
and Christian is to *not* use the drm scheduler for long-running compute 
jobs, but track any internal dma-fence dependencies (pipelined clearing 
or whatever) in a separate mechanism and handle unresolved dependencies 
on other long-running jobs using -EWOULDBLOCK?

Thanks,
Thomas

>>>>>> And the reasons why it was rejected haven't changed.
>>>>>>
>>>>>> Regards,
>>>>>> Christian.
>>>>>>
>>>>> Yes, TBH this was mostly to get discussion going how we'd best
>>>>> tackle this problem while being able to reuse the scheduler for
>>>>> long-running workloads.
>>>>>
>>>>> I couldn't see any clear decision on your series, though, but one
>>>>> main difference I see is that this is intended for driver-internal
>>>>> use only. (I'm counting using the drm_scheduler as a helper for
>>>>> driver-private use). This is by no means a way to try tackle the
>>>>> indefinite fence problem.
>>>> Well this was just my latest try to tackle this, but essentially the
>>>> problems are the same as with your approach: When we express such
>>>> operations as dma_fence there is always the change that we leak that
>>>> somewhere.
>>>>
>>>> My approach of adding a flag noting that this operation is dangerous and
>>>> can't be synced with something memory management depends on tried to
>>>> contain this as much as possible, but Daniel still pretty clearly
>>>> rejected it (for good reasons I think).
>>>>
>>>>> We could ofc invent a completely different data-type that abstracts
>>>>> the synchronization the scheduler needs in the long-running case, or
>>>>> each driver could hack something up, like sleeping in the
>>>>> prepare_job() or run_job() callback for throttling, but those waits
>>>>> should still be annotated in one way or annotated one way or another
>>>>> (and probably in a similar way across drivers) to make sure we don't
>>>>> do anything bad.
>>>>>
>>>>>   So any suggestions as to what would be the better solution here
>>>>> would be appreciated.
>>>> Mhm, do we really the the GPU scheduler for that?
>>>>
>> I think we need to solve this within the DRM scheduler one way or
>> another.
> Yeah so if we conclude that the queue really must be bottomless then I
> agree drm-sched should help out sort out the mess. Because I'm guessing
> that every driver will have this issue. But that's a big if.
>
> I guess if we teach the drm scheduler that some jobs are fairly endless
> then maybe it wouldn't be too far-fetched to also teach it to wait for a
> previous one to finish (but not with the dma_fence that preempts, which we
> put into the dma_resv for memory management, but some other struct
> completion). The scheduler already has a concept of not stuffing too much
> stuff into the same queue after all, so this should fit?
> -Daniel
>
>
>>>> I mean in the 1 to 1 case  you basically just need a component which
>>>> collects the dependencies as dma_fence and if all of them are fulfilled
>>>> schedules a work item.
>>>>
>>>> As long as the work item itself doesn't produce a dma_fence it can then
>>>> still just wait for other none dma_fence dependencies.
>>>>
>>>> Then the work function could submit the work and wait for the result.
>>>>
>>>> The work item would then pretty much represent what you want, you can
>>>> wait for it to finish and pass it along as long running dependency.
>>>>
>>>> Maybe give it a funky name and wrap it up in a structure, but that's
>>>> basically it.
>>>>
>>> This very much sounds like a i915_sw_fence for the dependency tracking and
>>> dma_fence_work for the actual work although it's completion fence is a
>>> dma_fence.
>>>
>> Agree this does sound to i915ish as stated below one of mandates in Xe
>> was to use the DRM scheduler. Beyond that as someone who a submission
>> backend in the i915 and Xe, I love how the DRM scheduler works (single
>> entry point), it makes everything so much easier.
>>
>> Matt
>>
>>> Although that goes against the whole idea of a condition for merging the xe
>>> driver would be that we implement some sort of minimal scaffolding for
>>> long-running workloads in the drm scheduler, and the thinking behind that is
>>> to avoid implementing intel-specific solutions like those...
>>>
>>> Thanks,
>>>
>>> Thomas
>>>
>>>
>>>
>>>> Regards,
>>>> Christian.
>>>>
>>>>> Thanks,
>>>>>
>>>>> Thomas
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>