[Intel-xe] [PATCH v2 1/9] drm/sched: Convert drm scheduler to use a work queue rather than kthread

Thu Aug 17 12:48:31 UTC 2023

On 8/17/23 15:35, Christian König wrote:
> Am 17.08.23 um 13:13 schrieb Danilo Krummrich:
>> On 8/17/23 07:33, Christian König wrote:
>>> [SNIP]
>>> The hardware seems to work mostly the same for all vendors, but you 
>>> somehow seem to think that filling the ring is somehow beneficial 
>>> which is really not the case as far as I can see.
>>
>> I think that's a misunderstanding. I'm not trying to say that it is 
>> *always* beneficial to fill up the ring as much as possible. But I 
>> think it is under certain circumstances, exactly those circumstances I 
>> described for Nouveau.
> 
> As far as I can see this is not correct for Nouveau either.
> 
>>
>> As mentioned, in Nouveau the size of a job is only really limited by 
>> the ring size, which means that one job can (but does not necessarily) 
>> fill up the whole ring. We both agree that this is inefficient, 
>> because it potentially results into the HW run dry due to 
>> hw_submission_limit == 1.
>>
>> I recognize you said that one should define hw_submission_limit and 
>> adjust the other parts of the equation accordingly, the options I see 
>> are:
>>
>> (1) Increase the ring size while keeping the maximum job size.
>> (2) Decrease the maximum job size while keeping the ring size.
>> (3) Let the scheduler track the actual job size rather than the 
>> maximum job size.
>>
>> (1) results into potentially wasted ring memory, because we're not 
>> always reaching the maximum job size, but the scheduler assumes so.
>>
>> (2) results into more IOCTLs from userspace for the same amount of IBs 
>> and more jobs result into more memory allocations and more work being 
>> submitted to the workqueue (with Matt's patches).
>>
>> (3) doesn't seem to have any of those draw backs.
>>
>> What would be your take on that?
>>
>> Actually, if none of the other drivers is interested into a more 
>> precise way of keeping track of the ring utilization, I'd be totally 
>> fine to do it in a driver specific way. However, unfortunately I don't 
>> see how this would be possible.
>>
>> My proposal would be to just keep the hw_submission_limit (maybe 
>> rename it to submission_unit_limit) and add a submission_units field 
>> to struct drm_sched_job. By default a jobs submission_units field 
>> would be 0 and the scheduler would behave the exact same way as it 
>> does now.
>>
>> Accordingly, jobs with submission_units > 1 would contribute more than 
>> one unit to the submission_unit_limit.
>>
>> What do you think about that?
> 
> I think you are approaching this from the completely wrong side.

First of all, thanks for keeping up the discussion - I appreciate it. 
Some more comments / questions below.

> 
> See the UAPI needs to be stable, so you need a maximum job size 
> otherwise it can happen that a combination of large and small 
> submissions work while a different combination doesn't.

How is this related to the uAPI being stable? What do you mean by 
'stable' in this context?

The Nouveau uAPI allows userspace to pass EXEC jobs by supplying the 
ring ID (channel), in-/out-syncs and a certain amount of indirect push 
buffers. The amount of IBs per job is limited by the amount of IBs 
fitting into the ring. Just to be clear, when I say 'job size' I mean 
the amount of IBs per job.

Maybe I should also mention that the rings we are talking about are 
software rings managed by a firmware scheduler. We can have an arbitrary 
amount of software rings and even multiple ones per FD.

Given a constant ring size I really don't see why I should limit the 
maximum amount of IBs userspace can push per job just to end up with a 
hw_submission_limit > 1.

For example, let's just assume the ring can take 128 IBs, why would I 
limit userspace to submit just e.g. 16 IBs at a time, such that the 
hw_submission_limit becomes 8?

What is the advantage of doing that, rather than letting userspace 
submit *up to* 128 IBs per job and just letting the scheduler push IBs 
to the ring as long as there's actually space left on the ring?

> 
> So what you usually do, and this is driver independent because simply a 
> requirement of the UAPI, is that you say here that's my maximum job size 
> as well as the number of submission which should be pushed to the hw at 
> the same time. And then get the resulting ring size by the product of 
> the two.

Given the above, how is that a requirement of the uAPI?

> 
> That the ring in this use case can't be fully utilized is not a draw 
> back, this is completely intentional design which should apply to all 
> drivers independent of the vendor.

Why wouldn't we want to fully utilize the ring size?

- Danilo

> 
>>
>> Besides all that, you said that filling up the ring just enough to not 
>> let the HW run dry rather than filling it up entirely is desirable. 
>> Why do you think so? I tend to think that in most cases it shouldn't 
>> make difference.
> 
> That results in better scheduling behavior. It's mostly beneficial if 
> you don't have a hw scheduler, but as far as I can see there is no need 
> to pump everything to the hw as fast as possible.
> 
> Regards,
> Christian.
> 
>>
>> - Danilo
>>
>>>
>>> Regards,
>>> Christian.
>>>
>>>> Because one really is the minimum if you want to do work at all, but 
>>>> as you mentioned above a job limit of one can let the ring run dry.
>>>>
>>>> In the end my proposal comes down to tracking the actual size of a 
>>>> job rather than just assuming a pre-defined maximum job size, and 
>>>> hence a dynamic job limit.
>>>>
>>>> I don't think this would hurt the scheduler granularity. In fact, it 
>>>> should even contribute to the desire of not letting the ring run dry 
>>>> even better. Especially for sequences of small jobs, where the 
>>>> current implementation might wrongly assume the ring is already full 
>>>> although actually there would still be enough space left.
>>>>
>>>>>
>>>>> Christian.
>>>>>
>>>>>>
>>>>>>>
>>>>>>> Otherwise your scheduler might just overwrite the ring buffer by 
>>>>>>> pushing things to fast.
>>>>>>>
>>>>>>> Christian.
>>>>>>>
>>>>>>>>
>>>>>>>> Given that, it seems like it would be better to let the 
>>>>>>>> scheduler keep track of empty ring "slots" instead, such that 
>>>>>>>> the scheduler can deceide whether a subsequent job will still 
>>>>>>>> fit on the ring and if not re-evaluate once a previous job 
>>>>>>>> finished. Of course each submitted job would be required to 
>>>>>>>> carry the number of slots it requires on the ring.
>>>>>>>>
>>>>>>>> What to you think of implementing this as alternative flow 
>>>>>>>> control mechanism? Implementation wise this could be a union 
>>>>>>>> with the existing hw_submission_limit.
>>>>>>>>
>>>>>>>> - Danilo
>>>>>>>>
>>>>>>>>>
>>>>>>>>> A problem with this design is currently a drm_gpu_scheduler uses a
>>>>>>>>> kthread for submission / job cleanup. This doesn't scale if a 
>>>>>>>>> large
>>>>>>>>> number of drm_gpu_scheduler are used. To work around the 
>>>>>>>>> scaling issue,
>>>>>>>>> use a worker rather than kthread for submission / job cleanup.
>>>>>>>>>
>>>>>>>>> v2:
>>>>>>>>>    - (Rob Clark) Fix msm build
>>>>>>>>>    - Pass in run work queue
>>>>>>>>> v3:
>>>>>>>>>    - (Boris) don't have loop in worker
>>>>>>>>> v4:
>>>>>>>>>    - (Tvrtko) break out submit ready, stop, start helpers into 
>>>>>>>>> own patch
>>>>>>>>>
>>>>>>>>> Signed-off-by: Matthew Brost <matthew.brost at intel.com>
>>>>>>>>
>>>>>>>
>>>>>>
>>>>>
>>>>
>>>
>>
>