[Intel-xe] [PATCH v2 1/9] drm/sched: Convert drm scheduler to use a work queue rather than kthread

Wed Aug 16 14:05:39 UTC 2023

Am 16.08.23 um 13:30 schrieb Danilo Krummrich:
> Hi Matt,
>
> On 8/11/23 04:31, Matthew Brost wrote:
>> In XE, the new Intel GPU driver, a choice has made to have a 1 to 1
>> mapping between a drm_gpu_scheduler and drm_sched_entity. At first this
>> seems a bit odd but let us explain the reasoning below.
>>
>> 1. In XE the submission order from multiple drm_sched_entity is not
>> guaranteed to be the same completion even if targeting the same hardware
>> engine. This is because in XE we have a firmware scheduler, the GuC,
>> which allowed to reorder, timeslice, and preempt submissions. If a using
>> shared drm_gpu_scheduler across multiple drm_sched_entity, the TDR falls
>> apart as the TDR expects submission order == completion order. Using a
>> dedicated drm_gpu_scheduler per drm_sched_entity solve this problem.
>>
>> 2. In XE submissions are done via programming a ring buffer (circular
>> buffer), a drm_gpu_scheduler provides a limit on number of jobs, if the
>> limit of number jobs is set to RING_SIZE / MAX_SIZE_PER_JOB we get flow
>> control on the ring for free.
>
> In XE, where does the limitation of MAX_SIZE_PER_JOB come from?
>
> In Nouveau we currently do have such a limitation as well, but it is 
> derived from the RING_SIZE, hence RING_SIZE / MAX_SIZE_PER_JOB would 
> always be 1. However, I think most jobs won't actually utilize the 
> whole ring.

Well that should probably rather be RING_SIZE / MAX_SIZE_PER_JOB = 
hw_submission_limit (or even hw_submission_limit - 1 when the hw can't 
distinct full vs empty ring buffer).

Otherwise your scheduler might just overwrite the ring buffer by pushing 
things to fast.

Christian.

>
> Given that, it seems like it would be better to let the scheduler keep 
> track of empty ring "slots" instead, such that the scheduler can 
> deceide whether a subsequent job will still fit on the ring and if not 
> re-evaluate once a previous job finished. Of course each submitted job 
> would be required to carry the number of slots it requires on the ring.
>
> What to you think of implementing this as alternative flow control 
> mechanism? Implementation wise this could be a union with the existing 
> hw_submission_limit.
>
> - Danilo
>
>>
>> A problem with this design is currently a drm_gpu_scheduler uses a
>> kthread for submission / job cleanup. This doesn't scale if a large
>> number of drm_gpu_scheduler are used. To work around the scaling issue,
>> use a worker rather than kthread for submission / job cleanup.
>>
>> v2:
>>    - (Rob Clark) Fix msm build
>>    - Pass in run work queue
>> v3:
>>    - (Boris) don't have loop in worker
>> v4:
>>    - (Tvrtko) break out submit ready, stop, start helpers into own patch
>>
>> Signed-off-by: Matthew Brost <matthew.brost at intel.com>
>