[Intel-xe] [PATCH v2 1/9] drm/sched: Convert drm scheduler to use a work queue rather than kthread

Danilo Krummrich dakr at redhat.com
Fri Aug 18 11:58:15 UTC 2023


On 8/17/23 18:17, Christian König wrote:
> Am 17.08.23 um 14:48 schrieb Danilo Krummrich:
>> On 8/17/23 15:35, Christian König wrote:
>>> Am 17.08.23 um 13:13 schrieb Danilo Krummrich:
>>>> On 8/17/23 07:33, Christian König wrote:
>>>>> [SNIP]
>>>>> My proposal would be to just keep the hw_submission_limit (maybe 
>>>>> rename it to submission_unit_limit) and add a submission_units 
>>>>> field to struct drm_sched_job. By default a jobs submission_units 
>>>>> field would be 0 and the scheduler would behave the exact same way 
>>>>> as it does now.
>>>>>
>>>>> Accordingly, jobs with submission_units > 1 would contribute more 
>>>>> than one unit to the submission_unit_limit.
>>>>>
>>>>> What do you think about that?
>>>
>>> I think you are approaching this from the completely wrong side.
>>
>> First of all, thanks for keeping up the discussion - I appreciate it. 
>> Some more comments / questions below.
>>
>>>
>>> See the UAPI needs to be stable, so you need a maximum job size 
>>> otherwise it can happen that a combination of large and small 
>>> submissions work while a different combination doesn't.
>>
>> How is this related to the uAPI being stable? What do you mean by 
>> 'stable' in this context?
> 
> Stable is in you don't get indifferent behavior, not stable is in the 
> sense of backward compatibility. Sorry for the confusing wording :)
> 
>>
>> The Nouveau uAPI allows userspace to pass EXEC jobs by supplying the 
>> ring ID (channel), in-/out-syncs and a certain amount of indirect push 
>> buffers. The amount of IBs per job is limited by the amount of IBs 
>> fitting into the ring. Just to be clear, when I say 'job size' I mean 
>> the amount of IBs per job.
> 
> Well that more or less sounds identical to all other hardware I know of, 
> e.g. AMD, Intel and the different ARM chips seem to all work like this. 
> But on those drivers the job size limit is not the ring size, but rather 
> a fixed value (at least as far as I know).
> 
>>
>> Maybe I should also mention that the rings we are talking about are 
>> software rings managed by a firmware scheduler. We can have an 
>> arbitrary amount of software rings and even multiple ones per FD.
>>
>> Given a constant ring size I really don't see why I should limit the 
>> maximum amount of IBs userspace can push per job just to end up with a 
>> hw_submission_limit > 1.
>>
>> For example, let's just assume the ring can take 128 IBs, why would I 
>> limit userspace to submit just e.g. 16 IBs at a time, such that the 
>> hw_submission_limit becomes 8?
> 
> Well the question is what happens when you have two submissions back to 
> back which use more than halve of the ring buffer?
> 
> I only see two possible outcomes:
> 1. You return -EBUSY (or similar) error code indicating the the hw can't 
> receive more commands.
> 2. Wait on previously pushed commands to be executed.
> (3. Your driver crash because you accidentally overwrite stuff in the 
> ring buffer which is still executed. I just assume that's prevented).
> 
> Resolution #1 with -EBUSY is actually something the UAPI should not do, 
> because your UAPI then depends on the specific timing of submissions 
> which is a really bad idea.
> 
> Resolution #2 is usually bad because it forces the hw to run dry between 
> submission and so degrade performance.

I agree, that is a good reason for at least limiting the maximum job 
size to half of the ring size.

However, there could still be cases where two subsequent jobs are 
submitted with just a single IB, which as is would still block 
subsequent jobs to be pushed to the ring although there is still plenty 
of space. Depending on the (CPU) scheduler latency, such a case can let 
the HW run dry as well.

Surely, we could just continue decrease the maximum job size even 
further, but this would result in further overhead on user and kernel 
for larger IB counts. Tracking the actual job size seems to be the 
better solution for drivers where the job size can vary over a rather 
huge range.

- Danilo

> 
>>
>> What is the advantage of doing that, rather than letting userspace 
>> submit *up to* 128 IBs per job and just letting the scheduler push IBs 
>> to the ring as long as there's actually space left on the ring?
> 
> Predictable behavior I think. Basically you want organize things so that 
> the hw is at least kept busy all the time without depending on actual 
> timing.
> 
>>
>>>
>>> So what you usually do, and this is driver independent because simply 
>>> a requirement of the UAPI, is that you say here that's my maximum job 
>>> size as well as the number of submission which should be pushed to 
>>> the hw at the same time. And then get the resulting ring size by the 
>>> product of the two.
>>
>> Given the above, how is that a requirement of the uAPI?
> 
> The requirement of the UAPI is actually pretty simple: You should get 
> consistent results, independent of the timing (at least as long as you 
> don't do stuff in parallel).
> 
> Otherwise you can run into issues when on a certain configuration stuff 
> suddenly runs faster or slower than expected. In other words you should 
> not depend on that stuff finishes in a certain amount of time.
> 
>>
>>>
>>> That the ring in this use case can't be fully utilized is not a draw 
>>> back, this is completely intentional design which should apply to all 
>>> drivers independent of the vendor.
>>
>> Why wouldn't we want to fully utilize the ring size?
> 
> As far as I know everybody restricts the submission size to something 
> fixed which is at least smaller than halve the ring size to avoid the 
> problems mentioned above.
> 
> Regards,
> Christian.
> 
>>
>> - Danilo
>>
>>>
>>>>
>>>> Besides all that, you said that filling up the ring just enough to 
>>>> not let the HW run dry rather than filling it up entirely is 
>>>> desirable. Why do you think so? I tend to think that in most cases 
>>>> it shouldn't make difference.
>>>
>>> That results in better scheduling behavior. It's mostly beneficial if 
>>> you don't have a hw scheduler, but as far as I can see there is no 
>>> need to pump everything to the hw as fast as possible.
>>>
>>> Regards,
>>> Christian.
>>>
>>>>
>>>> - Danilo
>>>>
>>>>>
>>>>> Regards,
>>>>> Christian.
>>>>>
>>>>>> Because one really is the minimum if you want to do work at all, 
>>>>>> but as you mentioned above a job limit of one can let the ring run 
>>>>>> dry.
>>>>>>
>>>>>> In the end my proposal comes down to tracking the actual size of a 
>>>>>> job rather than just assuming a pre-defined maximum job size, and 
>>>>>> hence a dynamic job limit.
>>>>>>
>>>>>> I don't think this would hurt the scheduler granularity. In fact, 
>>>>>> it should even contribute to the desire of not letting the ring 
>>>>>> run dry even better. Especially for sequences of small jobs, where 
>>>>>> the current implementation might wrongly assume the ring is 
>>>>>> already full although actually there would still be enough space 
>>>>>> left.
>>>>>>
>>>>>>>
>>>>>>> Christian.
>>>>>>>
>>>>>>>>
>>>>>>>>>
>>>>>>>>> Otherwise your scheduler might just overwrite the ring buffer 
>>>>>>>>> by pushing things to fast.
>>>>>>>>>
>>>>>>>>> Christian.
>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> Given that, it seems like it would be better to let the 
>>>>>>>>>> scheduler keep track of empty ring "slots" instead, such that 
>>>>>>>>>> the scheduler can deceide whether a subsequent job will still 
>>>>>>>>>> fit on the ring and if not re-evaluate once a previous job 
>>>>>>>>>> finished. Of course each submitted job would be required to 
>>>>>>>>>> carry the number of slots it requires on the ring.
>>>>>>>>>>
>>>>>>>>>> What to you think of implementing this as alternative flow 
>>>>>>>>>> control mechanism? Implementation wise this could be a union 
>>>>>>>>>> with the existing hw_submission_limit.
>>>>>>>>>>
>>>>>>>>>> - Danilo
>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> A problem with this design is currently a drm_gpu_scheduler 
>>>>>>>>>>> uses a
>>>>>>>>>>> kthread for submission / job cleanup. This doesn't scale if a 
>>>>>>>>>>> large
>>>>>>>>>>> number of drm_gpu_scheduler are used. To work around the 
>>>>>>>>>>> scaling issue,
>>>>>>>>>>> use a worker rather than kthread for submission / job cleanup.
>>>>>>>>>>>
>>>>>>>>>>> v2:
>>>>>>>>>>>    - (Rob Clark) Fix msm build
>>>>>>>>>>>    - Pass in run work queue
>>>>>>>>>>> v3:
>>>>>>>>>>>    - (Boris) don't have loop in worker
>>>>>>>>>>> v4:
>>>>>>>>>>>    - (Tvrtko) break out submit ready, stop, start helpers 
>>>>>>>>>>> into own patch
>>>>>>>>>>>
>>>>>>>>>>> Signed-off-by: Matthew Brost <matthew.brost at intel.com>
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>
>>>>>
>>>>
>>>
>>
> 



More information about the Intel-xe mailing list