[RFC PATCH 00/10] Xe DRM scheduler and long running workload plans

Thu Apr 6 10:14:36 UTC 2023

Am 06.04.23 um 08:37 schrieb Daniel Vetter:
> On Thu, Apr 06, 2023 at 02:08:10AM +0000, Matthew Brost wrote:
>> On Wed, Apr 05, 2023 at 12:12:27PM +0200, Daniel Vetter wrote:
>>> On Wed, 5 Apr 2023 at 11:57, Christian König <christian.koenig at amd.com> wrote:
>>>> Am 05.04.23 um 11:07 schrieb Daniel Vetter:
>>>>> [SNIP]
>>>>>> I would approach it from the complete other side. This component here is a
>>>>>> tool to decide what job should run next.
>>>>>>
>>>>>> How that is then signaled and run should not be part of the scheduler, but
>>>>>> another more higher level component.
>>>>>>
>>>>>> This way you also don't have a problem with not using DMA-fences as
>>>>>> dependencies as well as constrains for running more jobs.
>>>>> I think we're talking about two things here and mixing them up.
>>>>>
>>>>> For the dependencies I agree with you, and imo that higher level tool
>>>>> should probably just be an on-demand submit thread in userspace for the
>>>>> rare case where the kernel would need to sort out a dependency otherwise
>>>>> (due to running out of ringspace in the per-ctx ringbuffer).
>>>>>
>>>>> The other thing is the message passing stuff, and this is what I was
>>>>> talking about above. This has nothing to do with handling dependencies,
>>>>> but with talking to the gpu fw. Here the intel design issue is that the fw
>>>>> only provides a single queue, and it's in-order. Which means it
>>>>> fundamentally has the stalling issue you describe as a point against a
>>>>> message passing design. And fundamentally we need to be able to talk to
>>>>> the fw in the scheduler ->run_job callback.
>>>>>
>>>>> The proposal here for the message passing part is that since it has the
>>>>> stalling issue already anyway, and the scheduler needs to be involved
>>>>> anyway, it makes sense to integrated this (as an optional thing, only for
>>>>> drivers which have this kind of fw interface) into the scheduler.
>>>>> Otherwise you just end up with two layers for no reason and more ping-pong
>>>>> delay because the ->run_job needs to kick off the subordinate driver layer
>>>>> first. Note that for this case the optional message passing support in the
>>>>> drm/scheduler actually makes things better, because it allows you to cut
>>>>> out one layer.
>>>>>
>>>>> Of course if a driver with better fw interface uses this message passing
>>>>> support, then that's bad. Hence the big warning in the kerneldoc.
>>>> Well what I wanted to say is that if you design the dependency handling
>>>> / scheduler properly you don't need the message passing through it.
>>>>
>>>> For example if the GPU scheduler component uses a work item to do it's
>>>> handling instead of a kthread you could also let the driver specify the
>>>> work queue where this work item is executed on.
>>>>
>>>> When you design it like this the driver specifies the thread context of
>>>> execution for it's job. In other words it can specify a single threaded
>>>> firmware work queue as well.
>>>>
>>>> When you then have other messages which needs to be passed to the
>>>> firmware you can also use the same single threaded workqueue for this.
>>>>
>>>> Drivers which have a different firmware interface would just use one of
>>>> the system work queues instead.
>>>>
>>>> This approach basically decouples the GPU scheduler component from the
>>>> message passing functionality.
>>> Hm I guess we've been talking past each another big time, because
>>> that's really what I thought was under discussions? Essentially the
>>> current rfc, but implementing with some polish.
>>>
>> I think Daniel pretty much nailed it here (thanks), to recap:
>>
>> 1. I want the messages in the same worker so run_job / free_job /
>> process_msg execution is mutual exclusive and also so during reset paths
>> if the worker is stopped all the entry points can't be entered.
>>
>> If this is a NAK, then another worker is fine I guess. A lock between
>> run_job / free_job + process_msg should solve the exclusion issue and the
>> reset paths can also stop this new worker too. That being said I'd
>> rather leave this as is but will not fight this point.
>>
>> 2. process_msg is just used to communicate with the firmware using the
>> same queue as submission. Waiting for space in this queue is the only
>> place this function can block (same as submission), well actually we
>> have the concept a preempt time slice but that sleeps for 10 ms by
>> default. Also preempt is only used in LR entities so I don't think it is
>> relavent in this case either.
>>
>> 3. Agree this is in the dma-fence signaling path (if process_msg is in
>> the submission worker) so we can't block indefinitely or an unreasonable
>> period of time (i.e. we must obey dma-fence rules).
> Just to hammer this in: Not just process_msg is in the dma_fence signaling
> path, but the entire fw queue where everything is being funneled through,
> including whatever the fw is doing to process these.
>
> Yes this is terrible and blew up a few times already :-/
>
> But also, probably something that the docs really need to hammer in, to
> make sure people don't look at this and thinkg "hey this seems to be the
> recommended way to do this on linux". We don't want hw people to build
> more of these designs, they're an absolute pain to deal with with Linux'
> dma_fence signalling and gpu job scheduling rules.
>
> It's just that if you're stuck with such fw, then integrating the flow
> into drm/sched instead of having an extra layer of workers seems the
> better of two pretty bad solutions.

Yeah and if you have such fw limitations, make sure that you use 
something which is understood by lockdep to feed into it.

In other words, either locks or work item/queue and not some message 
passing functionality through the scheduler.

As far as I can see the approach with the work item/queue should fit 
your needs here.

Christian.

> -Daniel
>
>   
>> 4. Agree the documentation for thw usage of the messaging interface
>> needs to be clear.
>>
>> 5. Agree that my code could alway use polishing.
>>
>> Lets close on #1 then can I get on general Ack on this part of the RFC
>> and apply the polish in the full review process?
>>
>> Matt
>>
>>> iow I agree with you (I think at least).
>>> -Daniel
>>> -- 
>>> Daniel Vetter
>>> Software Engineer, Intel Corporation
>>> http://blog.ffwll.ch