[Intel-gfx] [RFC PATCH 04/20] drm/sched: Convert drm scheduler to use a work queue rather than kthread

Wed Jan 11 01:13:45 UTC 2023

On Tue, Jan 10, 2023 at 04:39:00PM +0000, Matthew Brost wrote:
> On Tue, Jan 10, 2023 at 11:28:08AM +0000, Tvrtko Ursulin wrote:
> > 
> > 
> > On 09/01/2023 17:27, Jason Ekstrand wrote:
> > 
> > [snip]
> > 
> > >      >>> AFAICT it proposes to have 1:1 between *userspace* created
> > >     contexts (per
> > >      >>> context _and_ engine) and drm_sched. I am not sure avoiding
> > >     invasive changes
> > >      >>> to the shared code is in the spirit of the overall idea and instead
> > >      >>> opportunity should be used to look at way to refactor/improve
> > >     drm_sched.
> > > 
> > > 
> > > Maybe?  I'm not convinced that what Xe is doing is an abuse at all or
> > > really needs to drive a re-factor.  (More on that later.)  There's only
> > > one real issue which is that it fires off potentially a lot of kthreads.
> > > Even that's not that bad given that kthreads are pretty light and you're
> > > not likely to have more kthreads than userspace threads which are much
> > > heavier.  Not ideal, but not the end of the world either.  Definitely
> > > something we can/should optimize but if we went through with Xe without
> > > this patch, it would probably be mostly ok.
> > > 
> > >      >> Yes, it is 1:1 *userspace* engines and drm_sched.
> > >      >>
> > >      >> I'm not really prepared to make large changes to DRM scheduler
> > >     at the
> > >      >> moment for Xe as they are not really required nor does Boris
> > >     seem they
> > >      >> will be required for his work either. I am interested to see
> > >     what Boris
> > >      >> comes up with.
> > >      >>
> > >      >>> Even on the low level, the idea to replace drm_sched threads
> > >     with workers
> > >      >>> has a few problems.
> > >      >>>
> > >      >>> To start with, the pattern of:
> > >      >>>
> > >      >>>    while (not_stopped) {
> > >      >>>     keep picking jobs
> > >      >>>    }
> > >      >>>
> > >      >>> Feels fundamentally in disagreement with workers (while
> > >     obviously fits
> > >      >>> perfectly with the current kthread design).
> > >      >>
> > >      >> The while loop breaks and worker exists if no jobs are ready.
> > > 
> > > 
> > > I'm not very familiar with workqueues. What are you saying would fit
> > > better? One scheduling job per work item rather than one big work item
> > > which handles all available jobs?
> > 
> > Yes and no, it indeed IMO does not fit to have a work item which is
> > potentially unbound in runtime. But it is a bit moot conceptual mismatch
> > because it is a worst case / theoretical, and I think due more fundamental
> > concerns.
> > 
> > If we have to go back to the low level side of things, I've picked this
> > random spot to consolidate what I have already mentioned and perhaps expand.
> > 
> > To start with, let me pull out some thoughts from workqueue.rst:
> > 
> > """
> > Generally, work items are not expected to hog a CPU and consume many cycles.
> > That means maintaining just enough concurrency to prevent work processing
> > from stalling should be optimal.
> > """
> > 
> > For unbound queues:
> > """
> > The responsibility of regulating concurrency level is on the users.
> > """
> > 
> > Given the unbound queues will be spawned on demand to service all queued
> > work items (more interesting when mixing up with the system_unbound_wq), in
> > the proposed design the number of instantiated worker threads does not
> > correspond to the number of user threads (as you have elsewhere stated), but
> > pessimistically to the number of active user contexts. That is the number
> > which drives the maximum number of not-runnable jobs that can become
> > runnable at once, and hence spawn that many work items, and in turn unbound
> > worker threads.
> > 
> > Several problems there.
> > 
> > It is fundamentally pointless to have potentially that many more threads
> > than the number of CPU cores - it simply creates a scheduling storm.
> > 
> 
> We can use a different work queue if this is an issue, have a FIXME
> which indicates we should allow the user to pass in the work queue.
> 
> > Unbound workers have no CPU / cache locality either and no connection with
> > the CPU scheduler to optimize scheduling patterns. This may matter either on
> > large systems or on small ones. Whereas the current design allows for
> > scheduler to notice userspace CPU thread keeps waking up the same drm
> > scheduler kernel thread, and so it can keep them on the same CPU, the
> > unbound workers lose that ability and so 2nd CPU might be getting woken up
> > from low sleep for every submission.
> >
> 
> I guess I don't understand kthread vs. workqueue scheduling internals.
>  

Looked into this and we are not using unbound workers rather we are just
using the system_wq which is indeed bound. Again we can change this so a
user can just pass in worker too. After doing a of research bound
workers allows the scheduler to use locality too avoid that exact
problem your reading.

TL;DR I'm not buying any of these arguments although it is possible I am
missing something.

Matt 

> > Hence, apart from being a bit of a impedance mismatch, the proposal has the
> > potential to change performance and power patterns and both large and small
> > machines.
> >
> 
> We are going to have to test this out I suppose and play around to see
> if this design has any real world impacts. As Jason said, yea probably
> will need a bit of help here from others. Will CC relavent parties on
> next rev. 
>  
> > >      >>> Secondly, it probably demands separate workers (not optional),
> > >     otherwise
> > >      >>> behaviour of shared workqueues has either the potential to
> > >     explode number
> > >      >>> kernel threads anyway, or add latency.
> > >      >>>
> > >      >>
> > >      >> Right now the system_unbound_wq is used which does have a limit
> > >     on the
> > >      >> number of threads, right? I do have a FIXME to allow a worker to be
> > >      >> passed in similar to TDR.
> > >      >>
> > >      >> WRT to latency, the 1:1 ratio could actually have lower latency
> > >     as 2 GPU
> > >      >> schedulers can be pushing jobs into the backend / cleaning up
> > >     jobs in
> > >      >> parallel.
> > >      >>
> > >      >
> > >      > Thought of one more point here where why in Xe we absolutely want
> > >     a 1 to
> > >      > 1 ratio between entity and scheduler - the way we implement
> > >     timeslicing
> > >      > for preempt fences.
> > >      >
> > >      > Let me try to explain.
> > >      >
> > >      > Preempt fences are implemented via the generic messaging
> > >     interface [1]
> > >      > with suspend / resume messages. If a suspend messages is received to
> > >      > soon after calling resume (this is per entity) we simply sleep in the
> > >      > suspend call thus giving the entity a timeslice. This completely
> > >     falls
> > >      > apart with a many to 1 relationship as now a entity waiting for a
> > >      > timeslice blocks the other entities. Could we work aroudn this,
> > >     sure but
> > >      > just another bunch of code we'd have to add in Xe. Being to
> > >     freely sleep
> > >      > in backend without affecting other entities is really, really
> > >     nice IMO
> > >      > and I bet Xe isn't the only driver that is going to feel this way.
> > >      >
> > >      > Last thing I'll say regardless of how anyone feels about Xe using
> > >     a 1 to
> > >      > 1 relationship this patch IMO makes sense as I hope we can all
> > >     agree a
> > >      > workqueue scales better than kthreads.
> > > 
> > >     I don't know for sure what will scale better and for what use case,
> > >     combination of CPU cores vs number of GPU engines to keep busy vs other
> > >     system activity. But I wager someone is bound to ask for some
> > >     numbers to
> > >     make sure proposal is not negatively affecting any other drivers.
> > > 
> > > 
> > > Then let them ask.  Waving your hands vaguely in the direction of the
> > > rest of DRM and saying "Uh, someone (not me) might object" is profoundly
> > > unhelpful.  Sure, someone might.  That's why it's on dri-devel.  If you
> > > think there's someone in particular who might have a useful opinion on
> > > this, throw them in the CC so they don't miss the e-mail thread.
> > > 
> > > Or are you asking for numbers?  If so, what numbers are you asking for?
> > 
> > It was a heads up to the Xe team in case people weren't appreciating how the
> > proposed change has the potential influence power and performance across the
> > board. And nothing in the follow up discussion made me think it was
> > considered so I don't think it was redundant to raise it.
> > 
> > In my experience it is typical that such core changes come with some
> > numbers. Which is in case of drm scheduler is tricky and probably requires
> > explicitly asking everyone to test (rather than count on "don't miss the
> > email thread"). Real products can fail to ship due ten mW here or there.
> > Like suddenly an extra core prevented from getting into deep sleep.
> > 
> > If that was "profoundly unhelpful" so be it.
> > 
> > > Also, If we're talking about a design that might paint us into an
> > > Intel-HW-specific hole, that would be one thing.  But we're not.  We're
> > > talking about switching which kernel threading/task mechanism to use for
> > > what's really a very generic problem.  The core Xe design works without
> > > this patch (just with more kthreads).  If we land this patch or
> > > something like it and get it wrong and it causes a performance problem
> > > for someone down the line, we can revisit it.
> > 
> > For some definition of "it works" - I really wouldn't suggest shipping a
> > kthread per user context at any point.
> >
> 
> Yea, this is why using a workqueue rathre than a kthread was suggested
> to me by AMD. I should've put a suggested by on the commit message, need
> to dig through my emails and figure out who exactly suggested this.
>  
> > >     In any case that's a low level question caused by the high level design
> > >     decision. So I'd think first focus on the high level - which is the 1:1
> > >     mapping of entity to scheduler instance proposal.
> > > 
> > >     Fundamentally it will be up to the DRM maintainers and the community to
> > >     bless your approach. And it is important to stress 1:1 is about
> > >     userspace contexts, so I believe unlike any other current scheduler
> > >     user. And also important to stress this effectively does not make Xe
> > >     _really_ use the scheduler that much.
> > > 
> > > 
> > > I don't think this makes Xe nearly as much of a one-off as you think it
> > > does.  I've already told the Asahi team working on Apple M1/2 hardware
> > > to do it this way and it seems to be a pretty good mapping for them. I
> > > believe this is roughly the plan for nouveau as well.  It's not the way
> > > it currently works for anyone because most other groups aren't doing FW
> > > scheduling yet.  In the world of FW scheduling and hardware designed to
> > > support userspace direct-to-FW submit, I think the design makes perfect
> > > sense (see below) and I expect we'll see more drivers move in this
> > > direction as those drivers evolve.  (AMD is doing some customish thing
> > > for how with gpu_scheduler on the front-end somehow. I've not dug into
> > > those details.)
> > > 
> > >     I can only offer my opinion, which is that the two options mentioned in
> > >     this thread (either improve drm scheduler to cope with what is
> > >     required,
> > >     or split up the code so you can use just the parts of drm_sched which
> > >     you want - which is frontend dependency tracking) shouldn't be so
> > >     readily dismissed, given how I think the idea was for the new driver to
> > >     work less in a silo and more in the community (not do kludges to
> > >     workaround stuff because it is thought to be too hard to improve common
> > >     code), but fundamentally, "goto previous paragraph" for what I am
> > >     concerned.
> > > 
> > > 
> > > Meta comment:  It appears as if you're falling into the standard i915
> > > team trap of having an internal discussion about what the community
> > > discussion might look like instead of actually having the community
> > > discussion.  If you are seriously concerned about interactions with
> > > other drivers or whether or setting common direction, the right way to
> > > do that is to break a patch or two out into a separate RFC series and
> > > tag a handful of driver maintainers.  Trying to predict the questions
> > > other people might ask is pointless. Cc them and asking for their input
> > > instead.
> > 
> > I don't follow you here. It's not an internal discussion - I am raising my
> > concerns on the design publicly. I am supposed to write a patch to show
> > something, but am allowed to comment on a RFC series?
> > 
> > It is "drm/sched: Convert drm scheduler to use a work queue rather than
> > kthread" which should have Cc-ed _everyone_ who use drm scheduler.
> >
> 
> Yea, will do on next rev.
>  
> > > 
> > >     Regards,
> > > 
> > >     Tvrtko
> > > 
> > >     P.S. And as a related side note, there are more areas where drm_sched
> > >     could be improved, like for instance priority handling.
> > >     Take a look at msm_submitqueue_create / msm_gpu_convert_priority /
> > >     get_sched_entity to see how msm works around the drm_sched hardcoded
> > >     limit of available priority levels, in order to avoid having to leave a
> > >     hw capability unused. I suspect msm would be happier if they could have
> > >     all priority levels equal in terms of whether they apply only at the
> > >     frontend level or completely throughout the pipeline.
> > > 
> > >      > [1]
> > >     https://patchwork.freedesktop.org/patch/515857/?series=112189&rev=1
> > >     <https://patchwork.freedesktop.org/patch/515857/?series=112189&rev=1>
> > >      >
> > >      >>> What would be interesting to learn is whether the option of
> > >     refactoring
> > >      >>> drm_sched to deal with out of order completion was considered
> > >     and what were
> > >      >>> the conclusions.
> > >      >>>
> > >      >>
> > >      >> I coded this up a while back when trying to convert the i915 to
> > >     the DRM
> > >      >> scheduler it isn't all that hard either. The free flow control
> > >     on the
> > >      >> ring (e.g. set job limit == SIZE OF RING / MAX JOB SIZE) is
> > >     really what
> > >      >> sold me on the this design.
> > > 
> > > 
> > > You're not the only one to suggest supporting out-of-order completion.
> > > However, it's tricky and breaks a lot of internal assumptions of the
> > > scheduler. It also reduces functionality a bit because it can no longer
> > > automatically rate-limit HW/FW queues which are often fixed-size.  (Ok,
> > > yes, it probably could but it becomes a substantially harder problem.)
> > > 
> > > It also seems like a worse mapping to me.  The goal here is to turn
> > > submissions on a userspace-facing engine/queue into submissions to a FW
> > > queue submissions, sorting out any dma_fence dependencies.  Matt's
> > > description of saying this is a 1:1 mapping between sched/entity doesn't
> > > tell the whole story. It's a 1:1:1 mapping between xe_engine,
> > > gpu_scheduler, and GuC FW engine.  Why make it a 1:something:1 mapping?
> > > Why is that better?
> > 
> > As I have stated before, what I think what would fit well for Xe is one
> > drm_scheduler per engine class. In specific terms on our current hardware,
> > one drm scheduler instance for render, compute, blitter, video and video
> > enhance. Userspace contexts remain scheduler entities.
> >
> 
> I disagree.
>  
> > That way you avoid the whole kthread/kworker story and you have it actually
> > use the entity picking code in the scheduler, which may be useful when the
> > backend is congested.
> >
> 
> In practice the backend shouldn't be congested but if it is a mutex
> provides fairness probably better than using a shared scheduler. Also
> what you are suggesting doesn't make sense at all as the congestion is
> per-GT, so if anything we should use 1 scheduler per-GT not per engine
> class.
>  
> > Yes you have to solve the out of order problem so in my mind that is
> > something to discuss. What the problem actually is (just TDR?), how tricky
> > and why etc.
> >
> 
> Cleanup of jobs, TDR, replaying jobs, etc... It has decent amount of
> impact.
>  
> > And yes you lose the handy LRCA ring buffer size management so you'd have to
> > make those entities not runnable in some other way.
> >
> 
> Also we lose our preempt fence implemenation too. Again I don't see how
> the design you are suggesting is a win.
>  
> > Regarding the argument you raise below - would any of that make the frontend
> > / backend separation worse and why? Do you think it is less natural? If
> > neither is true then all remains is that it appears extra work to support
> > out of order completion of entities has been discounted in favour of an easy
> > but IMO inelegant option.
> > 
> > > There are two places where this 1:1:1 mapping is causing problems:
> > > 
> > >   1. It creates lots of kthreads. This is what this patch is trying to
> > > solve. IDK if it's solving it the best way but that's the goal.
> > > 
> > >   2. There are a far more limited number of communication queues between
> > > the kernel and GuC for more meta things like pausing and resuming
> > > queues, getting events back from GuC, etc. Unless we're in a weird
> > > pressure scenario, the amount of traffic on this queue should be low so
> > > we can probably just have one per physical device.  The vast majority of
> > > kernel -> GuC communication should be on the individual FW queue rings
> > > and maybe smashing in-memory doorbells.
> > 
> > I don't follow your terminology here. I suppose you are talking about global
> > GuC CT and context ringbuffers. If so then isn't "far more limited" actually
> > one?
> > 
> 
> We have 1 GuC GT per-GT.
> 
> Matt
> 
> > Regards,
> > 
> > Tvrtko
> > 
> > > Doing out-of-order completion sort-of solves the 1 but does nothing for
> > > 2 and actually makes managing FW queues harder because we no longer have
> > > built-in rate limiting.  Seems like a net loss to me.
> > > 
> > >      >>> Second option perhaps to split out the drm_sched code into
> > >     parts which would
> > >      >>> lend themselves more to "pick and choose" of its functionalities.
> > >      >>> Specifically, Xe wants frontend dependency tracking, but not
> > >     any scheduling
> > >      >>> really (neither least busy drm_sched, neither FIFO/RQ entity
> > >     picking), so
> > >      >>> even having all these data structures in memory is a waste.
> > >      >>>
> > >      >>
> > >      >> I don't think that we are wasting memory is a very good argument for
> > >      >> making intrusive changes to the DRM scheduler.
> > > 
> > > 
> > > Worse than that, I think the "we could split it up" kind-of misses the
> > > point of the way Xe is using drm/scheduler.  It's not just about
> > > re-using a tiny bit of dependency tracking code.  Using the scheduler in
> > > this way provides a clean separation between front-end and back-end.
> > > The job of the userspace-facing ioctl code is to shove things on the
> > > scheduler.  The job of the run_job callback is to encode the job into
> > > the FW queue format, stick it in the FW queue ring, and maybe smash a
> > > doorbell.  Everything else happens in terms of managing those queues
> > > side-band.  The gpu_scheduler code manages the front-end queues and Xe
> > > manages the FW queues via the Kernel <-> GuC communication rings.  From
> > > a high level, this is a really clean design.  There are potentially some
> > > sticky bits around the dual-use of dma_fence for scheduling and memory
> > > management but none of those are solved by breaking the DRM scheduler
> > > into chunks or getting rid of the 1:1:1 mapping.
> > > 
> > > If we split it out, we're basically asking the driver to implement a
> > > bunch of kthread or workqueue stuff, all the ring rate-limiting, etc.
> > > It may not be all that much code but also, why?  To save a few bytes of
> > > memory per engine?  Each engine already has 32K(ish) worth of context
> > > state and a similar size ring to communicate with the FW.  No one is
> > > going to notice an extra CPU data structure.
> > > 
> > > I'm not seeing a solid argument against the 1:1:1 design here other than
> > > that it doesn't seem like the way DRM scheduler was intended to be
> > > used.  I won't argue that.  It's not.  But it is a fairly natural way to
> > > take advantage of the benefits the DRM scheduler does provide while also
> > > mapping it to hardware that was designed for userspace direct-to-FW
> > > submit.
> > > 
> > > --Jason
> > > 
> > >      >>> With the first option then the end result could be drm_sched
> > >     per engine
> > >      >>> class (hardware view), which I think fits with the GuC model.
> > >     Give all
> > >      >>> schedulable contexts (entities) to the GuC and then mostly
> > >     forget about
> > >      >>> them. Timeslicing and re-ordering and all happens transparently
> > >     to the
> > >      >>> kernel from that point until completion.
> > >      >>>
> > >      >>
> > >      >> Out-of-order problem still exists here.
> > >      >>
> > >      >>> Or with the second option you would build on some smaller
> > >     refactored
> > >      >>> sub-components of drm_sched, by maybe splitting the dependency
> > >     tracking from
> > >      >>> scheduling (RR/FIFO entity picking code).
> > >      >>>
> > >      >>> Second option is especially a bit vague and I haven't thought
> > >     about the
> > >      >>> required mechanics, but it just appeared too obvious the
> > >     proposed design has
> > >      >>> a bit too much impedance mismatch.
> > >      >>>
> > >      >>
> > >      >> IMO ROI on this is low and again lets see what Boris comes up with.
> > >      >>
> > >      >> Matt
> > >      >>
> > >      >>> Oh and as a side note, when I went into the drm_sched code base
> > >     to remind
> > >      >>> myself how things worked, it is quite easy to find some FIXME
> > >     comments which
> > >      >>> suggest people working on it are unsure of locking desing there
> > >     and such. So
> > >      >>> perhaps that all needs cleanup too, I mean would benefit from
> > >      >>> refactoring/improving work as brainstormed above anyway.
> > >      >>>
> > >      >>> Regards,
> > >      >>>
> > >      >>> Tvrtko
> > >