[Intel-gfx] [RFC PATCH 04/20] drm/sched: Convert drm scheduler to use a work queue rather than kthread

Mon Jan 9 17:27:19 UTC 2023

On Mon, Jan 9, 2023 at 7:46 AM Tvrtko Ursulin <
tvrtko.ursulin at linux.intel.com> wrote:

>
> On 06/01/2023 23:52, Matthew Brost wrote:
> > On Thu, Jan 05, 2023 at 09:43:41PM +0000, Matthew Brost wrote:
> >> On Tue, Jan 03, 2023 at 01:02:15PM +0000, Tvrtko Ursulin wrote:
> >>>
> >>> On 02/01/2023 07:30, Boris Brezillon wrote:
> >>>> On Fri, 30 Dec 2022 12:55:08 +0100
> >>>> Boris Brezillon <boris.brezillon at collabora.com> wrote:
> >>>>
> >>>>> On Fri, 30 Dec 2022 11:20:42 +0100
> >>>>> Boris Brezillon <boris.brezillon at collabora.com> wrote:
> >>>>>
> >>>>>> Hello Matthew,
> >>>>>>
> >>>>>> On Thu, 22 Dec 2022 14:21:11 -0800
> >>>>>> Matthew Brost <matthew.brost at intel.com> wrote:
> >>>>>>> In XE, the new Intel GPU driver, a choice has made to have a 1 to 1
> >>>>>>> mapping between a drm_gpu_scheduler and drm_sched_entity. At first
> this
> >>>>>>> seems a bit odd but let us explain the reasoning below.
> >>>>>>>
> >>>>>>> 1. In XE the submission order from multiple drm_sched_entity is not
> >>>>>>> guaranteed to be the same completion even if targeting the same
> hardware
> >>>>>>> engine. This is because in XE we have a firmware scheduler, the
> GuC,
> >>>>>>> which allowed to reorder, timeslice, and preempt submissions. If a
> using
> >>>>>>> shared drm_gpu_scheduler across multiple drm_sched_entity, the TDR
> falls
> >>>>>>> apart as the TDR expects submission order == completion order.
> Using a
> >>>>>>> dedicated drm_gpu_scheduler per drm_sched_entity solve this
> problem.
> >>>>>>
> >>>>>> Oh, that's interesting. I've been trying to solve the same sort of
> >>>>>> issues to support Arm's new Mali GPU which is relying on a
> FW-assisted
> >>>>>> scheduling scheme (you give the FW N streams to execute, and it does
> >>>>>> the scheduling between those N command streams, the kernel driver
> >>>>>> does timeslice scheduling to update the command streams passed to
> the
> >>>>>> FW). I must admit I gave up on using drm_sched at some point, mostly
> >>>>>> because the integration with drm_sched was painful, but also
> because I
> >>>>>> felt trying to bend drm_sched to make it interact with a
> >>>>>> timeslice-oriented scheduling model wasn't really future proof.
> Giving
> >>>>>> drm_sched_entity exlusive access to a drm_gpu_scheduler probably
> might
> >>>>>> help for a few things (didn't think it through yet), but I feel it's
> >>>>>> coming short on other aspects we have to deal with on Arm GPUs.
> >>>>>
> >>>>> Ok, so I just had a quick look at the Xe driver and how it
> >>>>> instantiates the drm_sched_entity and drm_gpu_scheduler, and I think
> I
> >>>>> have a better understanding of how you get away with using drm_sched
> >>>>> while still controlling how scheduling is really done. Here
> >>>>> drm_gpu_scheduler is just a dummy abstract that let's you use the
> >>>>> drm_sched job queuing/dep/tracking mechanism. The whole run-queue
> >>>>> selection is dumb because there's only one entity ever bound to the
> >>>>> scheduler (the one that's part of the xe_guc_engine object which also
> >>>>> contains the drm_gpu_scheduler instance). I guess the main issue we'd
> >>>>> have on Arm is the fact that the stream doesn't necessarily get
> >>>>> scheduled when ->run_job() is called, it can be placed in the
> runnable
> >>>>> queue and be picked later by the kernel-side scheduler when a FW slot
> >>>>> gets released. That can probably be sorted out by manually disabling
> the
> >>>>> job timer and re-enabling it when the stream gets picked by the
> >>>>> scheduler. But my main concern remains, we're basically abusing
> >>>>> drm_sched here.
> >>>>>
> >>>>> For the Arm driver, that means turning the following sequence
> >>>>>
> >>>>> 1. wait for job deps
> >>>>> 2. queue job to ringbuf and push the stream to the runnable
> >>>>>      queue (if it wasn't queued already). Wakeup the timeslice
> scheduler
> >>>>>      to re-evaluate (if the stream is not on a FW slot already)
> >>>>> 3. stream gets picked by the timeslice scheduler and sent to the FW
> for
> >>>>>      execution
> >>>>>
> >>>>> into
> >>>>>
> >>>>> 1. queue job to entity which takes care of waiting for job deps for
> >>>>>      us
> >>>>> 2. schedule a drm_sched_main iteration
> >>>>> 3. the only available entity is picked, and the first job from this
> >>>>>      entity is dequeued. ->run_job() is called: the job is queued to
> the
> >>>>>      ringbuf and the stream is pushed to the runnable queue (if it
> wasn't
> >>>>>      queued already). Wakeup the timeslice scheduler to re-evaluate
> (if
> >>>>>      the stream is not on a FW slot already)
> >>>>> 4. stream gets picked by the timeslice scheduler and sent to the FW
> for
> >>>>>      execution
> >>>>>
> >>>>> That's one extra step we don't really need. To sum-up, yes, all the
> >>>>> job/entity tracking might be interesting to share/re-use, but I
> wonder
> >>>>> if we couldn't have that without pulling out the scheduling part of
> >>>>> drm_sched, or maybe I'm missing something, and there's something in
> >>>>> drm_gpu_scheduler you really need.
> >>>>
> >>>> On second thought, that's probably an acceptable overhead (not even
> >>>> sure the extra step I was mentioning exists in practice, because dep
> >>>> fence signaled state is checked as part of the drm_sched_main
> >>>> iteration, so that's basically replacing the worker I schedule to
> >>>> check job deps), and I like the idea of being able to re-use drm_sched
> >>>> dep-tracking without resorting to invasive changes to the existing
> >>>> logic, so I'll probably give it a try.
> >>>
> >>> I agree with the concerns and think that how Xe proposes to integrate
> with
> >>> drm_sched is a problem, or at least significantly inelegant.
> >>>
> >>
> >> Inelegant is a matter of opinion, I actually rather like this solution.
> >>
> >> BTW this isn't my design rather this was Jason's idea.
>

Sure, throw me under the bus, why don't you? :-P  Nah, it's all fine.  It's
my design and I'm happy to defend it or be blamed for it in the history
books as the case may be.

> >>> AFAICT it proposes to have 1:1 between *userspace* created contexts
> (per
> >>> context _and_ engine) and drm_sched. I am not sure avoiding invasive
> changes
> >>> to the shared code is in the spirit of the overall idea and instead
> >>> opportunity should be used to look at way to refactor/improve
> drm_sched.
>

Maybe?  I'm not convinced that what Xe is doing is an abuse at all or
really needs to drive a re-factor.  (More on that later.)  There's only one
real issue which is that it fires off potentially a lot of kthreads. Even
that's not that bad given that kthreads are pretty light and you're not
likely to have more kthreads than userspace threads which are much
heavier.  Not ideal, but not the end of the world either.  Definitely
something we can/should optimize but if we went through with Xe without
this patch, it would probably be mostly ok.

> >> Yes, it is 1:1 *userspace* engines and drm_sched.
> >>
> >> I'm not really prepared to make large changes to DRM scheduler at the
> >> moment for Xe as they are not really required nor does Boris seem they
> >> will be required for his work either. I am interested to see what Boris
> >> comes up with.
> >>
> >>> Even on the low level, the idea to replace drm_sched threads with
> workers
> >>> has a few problems.
> >>>
> >>> To start with, the pattern of:
> >>>
> >>>    while (not_stopped) {
> >>>     keep picking jobs
> >>>    }
> >>>
> >>> Feels fundamentally in disagreement with workers (while obviously fits
> >>> perfectly with the current kthread design).
> >>
> >> The while loop breaks and worker exists if no jobs are ready.
>

I'm not very familiar with workqueues. What are you saying would fit
better? One scheduling job per work item rather than one big work item
which handles all available jobs?

> >>> Secondly, it probably demands separate workers (not optional),
> otherwise
> >>> behaviour of shared workqueues has either the potential to explode
> number
> >>> kernel threads anyway, or add latency.
> >>>
> >>
> >> Right now the system_unbound_wq is used which does have a limit on the
> >> number of threads, right? I do have a FIXME to allow a worker to be
> >> passed in similar to TDR.
> >>
> >> WRT to latency, the 1:1 ratio could actually have lower latency as 2 GPU
> >> schedulers can be pushing jobs into the backend / cleaning up jobs in
> >> parallel.
> >>
> >
> > Thought of one more point here where why in Xe we absolutely want a 1 to
> > 1 ratio between entity and scheduler - the way we implement timeslicing
> > for preempt fences.
> >
> > Let me try to explain.
> >
> > Preempt fences are implemented via the generic messaging interface [1]
> > with suspend / resume messages. If a suspend messages is received to
> > soon after calling resume (this is per entity) we simply sleep in the
> > suspend call thus giving the entity a timeslice. This completely falls
> > apart with a many to 1 relationship as now a entity waiting for a
> > timeslice blocks the other entities. Could we work aroudn this, sure but
> > just another bunch of code we'd have to add in Xe. Being to freely sleep
> > in backend without affecting other entities is really, really nice IMO
> > and I bet Xe isn't the only driver that is going to feel this way.
> >
> > Last thing I'll say regardless of how anyone feels about Xe using a 1 to
> > 1 relationship this patch IMO makes sense as I hope we can all agree a
> > workqueue scales better than kthreads.
>
> I don't know for sure what will scale better and for what use case,
> combination of CPU cores vs number of GPU engines to keep busy vs other
> system activity. But I wager someone is bound to ask for some numbers to
> make sure proposal is not negatively affecting any other drivers.
>

Then let them ask.  Waving your hands vaguely in the direction of the rest
of DRM and saying "Uh, someone (not me) might object" is profoundly
unhelpful.  Sure, someone might.  That's why it's on dri-devel.  If you
think there's someone in particular who might have a useful opinion on
this, throw them in the CC so they don't miss the e-mail thread.

Or are you asking for numbers?  If so, what numbers are you asking for?

Also, If we're talking about a design that might paint us into an
Intel-HW-specific hole, that would be one thing.  But we're not.  We're
talking about switching which kernel threading/task mechanism to use for
what's really a very generic problem.  The core Xe design works without
this patch (just with more kthreads).  If we land this patch or something
like it and get it wrong and it causes a performance problem for someone
down the line, we can revisit it.

> In any case that's a low level question caused by the high level design
> decision. So I'd think first focus on the high level - which is the 1:1
> mapping of entity to scheduler instance proposal.
>
> Fundamentally it will be up to the DRM maintainers and the community to
> bless your approach. And it is important to stress 1:1 is about
> userspace contexts, so I believe unlike any other current scheduler
> user. And also important to stress this effectively does not make Xe
> _really_ use the scheduler that much.
>

I don't think this makes Xe nearly as much of a one-off as you think it
does.  I've already told the Asahi team working on Apple M1/2 hardware to
do it this way and it seems to be a pretty good mapping for them.  I
believe this is roughly the plan for nouveau as well.  It's not the way it
currently works for anyone because most other groups aren't doing FW
scheduling yet.  In the world of FW scheduling and hardware designed to
support userspace direct-to-FW submit, I think the design makes perfect
sense (see below) and I expect we'll see more drivers move in this
direction as those drivers evolve.  (AMD is doing some customish thing for
how with gpu_scheduler on the front-end somehow.  I've not dug into those
details.)

> I can only offer my opinion, which is that the two options mentioned in
> this thread (either improve drm scheduler to cope with what is required,
> or split up the code so you can use just the parts of drm_sched which
> you want - which is frontend dependency tracking) shouldn't be so
> readily dismissed, given how I think the idea was for the new driver to
> work less in a silo and more in the community (not do kludges to
> workaround stuff because it is thought to be too hard to improve common
> code), but fundamentally, "goto previous paragraph" for what I am
> concerned.
>

Meta comment:  It appears as if you're falling into the standard i915 team
trap of having an internal discussion about what the community discussion
might look like instead of actually having the community discussion.  If
you are seriously concerned about interactions with other drivers or
whether or setting common direction, the right way to do that is to break a
patch or two out into a separate RFC series and tag a handful of driver
maintainers.  Trying to predict the questions other people might ask is
pointless. Cc them and asking for their input instead.

> Regards,
>
> Tvrtko
>
> P.S. And as a related side note, there are more areas where drm_sched
> could be improved, like for instance priority handling.
> Take a look at msm_submitqueue_create / msm_gpu_convert_priority /
> get_sched_entity to see how msm works around the drm_sched hardcoded
> limit of available priority levels, in order to avoid having to leave a
> hw capability unused. I suspect msm would be happier if they could have
> all priority levels equal in terms of whether they apply only at the
> frontend level or completely throughout the pipeline.
>
> > [1] https://patchwork.freedesktop.org/patch/515857/?series=112189&rev=1
> >
> >>> What would be interesting to learn is whether the option of refactoring
> >>> drm_sched to deal with out of order completion was considered and what
> were
> >>> the conclusions.
> >>>
> >>
> >> I coded this up a while back when trying to convert the i915 to the DRM
> >> scheduler it isn't all that hard either. The free flow control on the
> >> ring (e.g. set job limit == SIZE OF RING / MAX JOB SIZE) is really what
> >> sold me on the this design.
>

You're not the only one to suggest supporting out-of-order completion.
However, it's tricky and breaks a lot of internal assumptions of the
scheduler. It also reduces functionality a bit because it can no longer
automatically rate-limit HW/FW queues which are often fixed-size.  (Ok,
yes, it probably could but it becomes a substantially harder problem.)

It also seems like a worse mapping to me.  The goal here is to turn
submissions on a userspace-facing engine/queue into submissions to a FW
queue submissions, sorting out any dma_fence dependencies.  Matt's
description of saying this is a 1:1 mapping between sched/entity doesn't
tell the whole story. It's a 1:1:1 mapping between xe_engine,
gpu_scheduler, and GuC FW engine.  Why make it a 1:something:1 mapping?
Why is that better?

There are two places where this 1:1:1 mapping is causing problems:

 1. It creates lots of kthreads. This is what this patch is trying to
solve. IDK if it's solving it the best way but that's the goal.

 2. There are a far more limited number of communication queues between the
kernel and GuC for more meta things like pausing and resuming queues,
getting events back from GuC, etc. Unless we're in a weird pressure
scenario, the amount of traffic on this queue should be low so we can
probably just have one per physical device.  The vast majority of kernel ->
GuC communication should be on the individual FW queue rings and maybe
smashing in-memory doorbells.

Doing out-of-order completion sort-of solves the 1 but does nothing for 2
and actually makes managing FW queues harder because we no longer have
built-in rate limiting.  Seems like a net loss to me.

>>> Second option perhaps to split out the drm_sched code into parts which
> would
> >>> lend themselves more to "pick and choose" of its functionalities.
> >>> Specifically, Xe wants frontend dependency tracking, but not any
> scheduling
> >>> really (neither least busy drm_sched, neither FIFO/RQ entity picking),
> so
> >>> even having all these data structures in memory is a waste.
> >>>
> >>
> >> I don't think that we are wasting memory is a very good argument for
> >> making intrusive changes to the DRM scheduler.
>

Worse than that, I think the "we could split it up" kind-of misses the
point of the way Xe is using drm/scheduler.  It's not just about re-using a
tiny bit of dependency tracking code.  Using the scheduler in this way
provides a clean separation between front-end and back-end.  The job of the
userspace-facing ioctl code is to shove things on the scheduler.  The job
of the run_job callback is to encode the job into the FW queue format,
stick it in the FW queue ring, and maybe smash a doorbell.  Everything else
happens in terms of managing those queues side-band.  The gpu_scheduler
code manages the front-end queues and Xe manages the FW queues via the
Kernel <-> GuC communication rings.  From a high level, this is a really
clean design.  There are potentially some sticky bits around the dual-use
of dma_fence for scheduling and memory management but none of those are
solved by breaking the DRM scheduler into chunks or getting rid of the
1:1:1 mapping.

If we split it out, we're basically asking the driver to implement a bunch
of kthread or workqueue stuff, all the ring rate-limiting, etc.  It may not
be all that much code but also, why?  To save a few bytes of memory per
engine?  Each engine already has 32K(ish) worth of context state and a
similar size ring to communicate with the FW.  No one is going to notice an
extra CPU data structure.

I'm not seeing a solid argument against the 1:1:1 design here other than
that it doesn't seem like the way DRM scheduler was intended to be used.  I
won't argue that.  It's not.  But it is a fairly natural way to take
advantage of the benefits the DRM scheduler does provide while also mapping
it to hardware that was designed for userspace direct-to-FW submit.

--Jason

> >>> With the first option then the end result could be drm_sched per engine
> >>> class (hardware view), which I think fits with the GuC model. Give all
> >>> schedulable contexts (entities) to the GuC and then mostly forget about
> >>> them. Timeslicing and re-ordering and all happens transparently to the
> >>> kernel from that point until completion.
> >>>
> >>
> >> Out-of-order problem still exists here.
> >>
> >>> Or with the second option you would build on some smaller refactored
> >>> sub-components of drm_sched, by maybe splitting the dependency
> tracking from
> >>> scheduling (RR/FIFO entity picking code).
> >>>
> >>> Second option is especially a bit vague and I haven't thought about the
> >>> required mechanics, but it just appeared too obvious the proposed
> design has
> >>> a bit too much impedance mismatch.
> >>>
> >>
> >> IMO ROI on this is low and again lets see what Boris comes up with.
> >>
> >> Matt
> >>
> >>> Oh and as a side note, when I went into the drm_sched code base to
> remind
> >>> myself how things worked, it is quite easy to find some FIXME comments
> which
> >>> suggest people working on it are unsure of locking desing there and
> such. So
> >>> perhaps that all needs cleanup too, I mean would benefit from
> >>> refactoring/improving work as brainstormed above anyway.
> >>>
> >>> Regards,
> >>>
> >>> Tvrtko
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://lists.freedesktop.org/archives/dri-devel/attachments/20230109/306cebb3/attachment-0001.htm>