[Intel-xe] [PATCH v2 1/9] drm/sched: Convert drm scheduler to use a work queue rather than kthread

Fri Aug 18 03:08:52 UTC 2023

On Thu, Aug 17, 2023 at 01:13:31PM +0200, Danilo Krummrich wrote:
> On 8/17/23 07:33, Christian König wrote:
> > Am 16.08.23 um 18:33 schrieb Danilo Krummrich:
> > > On 8/16/23 16:59, Christian König wrote:
> > > > Am 16.08.23 um 14:30 schrieb Danilo Krummrich:
> > > > > On 8/16/23 16:05, Christian König wrote:
> > > > > > Am 16.08.23 um 13:30 schrieb Danilo Krummrich:
> > > > > > > Hi Matt,
> > > > > > > 
> > > > > > > On 8/11/23 04:31, Matthew Brost wrote:
> > > > > > > > In XE, the new Intel GPU driver, a choice has made to have a 1 to 1
> > > > > > > > mapping between a drm_gpu_scheduler and
> > > > > > > > drm_sched_entity. At first this
> > > > > > > > seems a bit odd but let us explain the reasoning below.
> > > > > > > > 
> > > > > > > > 1. In XE the submission order from multiple drm_sched_entity is not
> > > > > > > > guaranteed to be the same completion even if
> > > > > > > > targeting the same hardware
> > > > > > > > engine. This is because in XE we have a firmware scheduler, the GuC,
> > > > > > > > which allowed to reorder, timeslice, and preempt
> > > > > > > > submissions. If a using
> > > > > > > > shared drm_gpu_scheduler across multiple
> > > > > > > > drm_sched_entity, the TDR falls
> > > > > > > > apart as the TDR expects submission order ==
> > > > > > > > completion order. Using a
> > > > > > > > dedicated drm_gpu_scheduler per drm_sched_entity solve this problem.
> > > > > > > > 
> > > > > > > > 2. In XE submissions are done via programming a
> > > > > > > > ring buffer (circular
> > > > > > > > buffer), a drm_gpu_scheduler provides a limit on
> > > > > > > > number of jobs, if the
> > > > > > > > limit of number jobs is set to RING_SIZE /
> > > > > > > > MAX_SIZE_PER_JOB we get flow
> > > > > > > > control on the ring for free.
> > > > > > > 
> > > > > > > In XE, where does the limitation of MAX_SIZE_PER_JOB come from?
> > > > > > > 
> > > > > > > In Nouveau we currently do have such a limitation as
> > > > > > > well, but it is derived from the RING_SIZE, hence
> > > > > > > RING_SIZE / MAX_SIZE_PER_JOB would always be 1.
> > > > > > > However, I think most jobs won't actually utilize
> > > > > > > the whole ring.
> > > > > > 
> > > > > > Well that should probably rather be RING_SIZE /
> > > > > > MAX_SIZE_PER_JOB = hw_submission_limit (or even
> > > > > > hw_submission_limit - 1 when the hw can't distinct full
> > > > > > vs empty ring buffer).
> > > > > 
> > > > > Not sure if I get you right, let me try to clarify what I
> > > > > was trying to say: I wanted to say that in Nouveau
> > > > > MAX_SIZE_PER_JOB isn't really limited by anything other than
> > > > > the RING_SIZE and hence we'd never allow more than 1 active
> > > > > job.
> > > > 
> > > > But that lets the hw run dry between submissions. That is
> > > > usually a pretty horrible idea for performance.
> > > 
> > > Correct, that's the reason why I said it seems to be more efficient
> > > to base ring flow control on the actual size of each incoming job
> > > rather than the maximum size of a job.
> > > 
> > > > 
> > > > > 
> > > > > However, it seems to be more efficient to base ring flow
> > > > > control on the actual size of each incoming job rather than
> > > > > the worst case, namely the maximum size of a job.
> > > > 
> > > > That doesn't sounds like a good idea to me. See we don't limit
> > > > the number of submitted jobs based on the ring size, but rather
> > > > we calculate the ring size based on the number of submitted
> > > > jobs.
> > > > 
> > > 
> > > My point isn't really about whether we derive the ring size from the
> > > job limit or the other way around. It's more about the job size (or
> > > its maximum size) being arbitrary.
> > > 
> > > As mentioned in my reply to Matt:
> > > 
> > > "In Nouveau, userspace can submit an arbitrary amount of addresses
> > > of indirect bufferes containing the ring instructions. The ring on
> > > the kernel side takes the addresses of the indirect buffers rather
> > > than the instructions themself. Hence, technically there isn't
> > > really a limit on the amount of IBs submitted by a job except for
> > > the ring size."
> > > 
> > > So, my point is that I don't really want to limit the job size
> > > artificially just to be able to fit multiple jobs into the ring even
> > > if they're submitted at their "artificial" maximum size, but rather
> > > track how much of the ring the submitted job actually occupies.
> > > 
> > > > In other words the hw_submission_limit defines the ring size,
> > > > not the other way around. And you usually want the
> > > > hw_submission_limit as low as possible for good scheduler
> > > > granularity and to avoid extra overhead.
> > > 
> > > I don't think you really mean "as low as possible", do you?
> > 
> > No, I do mean as low as possible or in other words as few as possible.
> > 
> > Ideally the scheduler would submit only the minimum amount of work to
> > the hardware to keep the hardware busy. >
> > The hardware seems to work mostly the same for all vendors, but you
> > somehow seem to think that filling the ring is somehow beneficial which
> > is really not the case as far as I can see.
> 
> I think that's a misunderstanding. I'm not trying to say that it is *always*
> beneficial to fill up the ring as much as possible. But I think it is under
> certain circumstances, exactly those circumstances I described for Nouveau.
> 
> As mentioned, in Nouveau the size of a job is only really limited by the
> ring size, which means that one job can (but does not necessarily) fill up
> the whole ring. We both agree that this is inefficient, because it
> potentially results into the HW run dry due to hw_submission_limit == 1.
> 
> I recognize you said that one should define hw_submission_limit and adjust
> the other parts of the equation accordingly, the options I see are:
> 
> (1) Increase the ring size while keeping the maximum job size.
> (2) Decrease the maximum job size while keeping the ring size.
> (3) Let the scheduler track the actual job size rather than the maximum job
> size.
> 
> (1) results into potentially wasted ring memory, because we're not always
> reaching the maximum job size, but the scheduler assumes so.
> 
> (2) results into more IOCTLs from userspace for the same amount of IBs and
> more jobs result into more memory allocations and more work being submitted
> to the workqueue (with Matt's patches).
> 
> (3) doesn't seem to have any of those draw backs.
> 
> What would be your take on that?
> 
> Actually, if none of the other drivers is interested into a more precise way
> of keeping track of the ring utilization, I'd be totally fine to do it in a
> driver specific way. However, unfortunately I don't see how this would be
> possible.
> 
> My proposal would be to just keep the hw_submission_limit (maybe rename it
> to submission_unit_limit) and add a submission_units field to struct
> drm_sched_job. By default a jobs submission_units field would be 0 and the
> scheduler would behave the exact same way as it does now.
> 
> Accordingly, jobs with submission_units > 1 would contribute more than one
> unit to the submission_unit_limit.
> 
> What do you think about that?
> 

This seems reasonible to me and a very minimal change to the scheduler.

Matt 

> Besides all that, you said that filling up the ring just enough to not let
> the HW run dry rather than filling it up entirely is desirable. Why do you
> think so? I tend to think that in most cases it shouldn't make difference.
> 
> - Danilo
> 
> > 
> > Regards,
> > Christian.
> > 
> > > Because one really is the minimum if you want to do work at all, but
> > > as you mentioned above a job limit of one can let the ring run dry.
> > > 
> > > In the end my proposal comes down to tracking the actual size of a
> > > job rather than just assuming a pre-defined maximum job size, and
> > > hence a dynamic job limit.
> > > 
> > > I don't think this would hurt the scheduler granularity. In fact, it
> > > should even contribute to the desire of not letting the ring run dry
> > > even better. Especially for sequences of small jobs, where the
> > > current implementation might wrongly assume the ring is already full
> > > although actually there would still be enough space left.
> > > 
> > > > 
> > > > Christian.
> > > > 
> > > > > 
> > > > > > 
> > > > > > Otherwise your scheduler might just overwrite the ring
> > > > > > buffer by pushing things to fast.
> > > > > > 
> > > > > > Christian.
> > > > > > 
> > > > > > > 
> > > > > > > Given that, it seems like it would be better to let
> > > > > > > the scheduler keep track of empty ring "slots"
> > > > > > > instead, such that the scheduler can deceide whether
> > > > > > > a subsequent job will still fit on the ring and if
> > > > > > > not re-evaluate once a previous job finished. Of
> > > > > > > course each submitted job would be required to carry
> > > > > > > the number of slots it requires on the ring.
> > > > > > > 
> > > > > > > What to you think of implementing this as
> > > > > > > alternative flow control mechanism? Implementation
> > > > > > > wise this could be a union with the existing
> > > > > > > hw_submission_limit.
> > > > > > > 
> > > > > > > - Danilo
> > > > > > > 
> > > > > > > > 
> > > > > > > > A problem with this design is currently a drm_gpu_scheduler uses a
> > > > > > > > kthread for submission / job cleanup. This doesn't scale if a large
> > > > > > > > number of drm_gpu_scheduler are used. To work
> > > > > > > > around the scaling issue,
> > > > > > > > use a worker rather than kthread for submission / job cleanup.
> > > > > > > > 
> > > > > > > > v2:
> > > > > > > >    - (Rob Clark) Fix msm build
> > > > > > > >    - Pass in run work queue
> > > > > > > > v3:
> > > > > > > >    - (Boris) don't have loop in worker
> > > > > > > > v4:
> > > > > > > >    - (Tvrtko) break out submit ready, stop,
> > > > > > > > start helpers into own patch
> > > > > > > > 
> > > > > > > > Signed-off-by: Matthew Brost <matthew.brost at intel.com>
> > > > > > > 
> > > > > > 
> > > > > 
> > > > 
> > > 
> > 
>