[Intel-gfx] [RFC PATCH 04/20] drm/sched: Convert drm scheduler to use a work queue rather than kthread

Wed Jan 11 22:56:44 UTC 2023

On Wed, Jan 11, 2023 at 4:32 PM Matthew Brost <matthew.brost at intel.com>
wrote:

> On Wed, Jan 11, 2023 at 04:18:01PM -0600, Jason Ekstrand wrote:
> > On Wed, Jan 11, 2023 at 2:50 AM Tvrtko Ursulin <
> > tvrtko.ursulin at linux.intel.com> wrote:
> >
> > >
> > > On 10/01/2023 14:08, Jason Ekstrand wrote:
> > > > On Tue, Jan 10, 2023 at 5:28 AM Tvrtko Ursulin
> > > > <tvrtko.ursulin at linux.intel.com <mailto:
> tvrtko.ursulin at linux.intel.com>>
> > >
> > > > wrote:
> > > >
> > > >
> > > >
> > > >     On 09/01/2023 17:27, Jason Ekstrand wrote:
> > > >
> > > >     [snip]
> > > >
> > > >      >      >>> AFAICT it proposes to have 1:1 between *userspace*
> > > created
> > > >      >     contexts (per
> > > >      >      >>> context _and_ engine) and drm_sched. I am not sure
> > > avoiding
> > > >      >     invasive changes
> > > >      >      >>> to the shared code is in the spirit of the overall
> idea
> > > >     and instead
> > > >      >      >>> opportunity should be used to look at way to
> > > >     refactor/improve
> > > >      >     drm_sched.
> > > >      >
> > > >      >
> > > >      > Maybe?  I'm not convinced that what Xe is doing is an abuse at
> > > >     all or
> > > >      > really needs to drive a re-factor.  (More on that later.)
> > > >     There's only
> > > >      > one real issue which is that it fires off potentially a lot of
> > > >     kthreads.
> > > >      > Even that's not that bad given that kthreads are pretty light
> and
> > > >     you're
> > > >      > not likely to have more kthreads than userspace threads which
> are
> > > >     much
> > > >      > heavier.  Not ideal, but not the end of the world either.
> > > >     Definitely
> > > >      > something we can/should optimize but if we went through with
> Xe
> > > >     without
> > > >      > this patch, it would probably be mostly ok.
> > > >      >
> > > >      >      >> Yes, it is 1:1 *userspace* engines and drm_sched.
> > > >      >      >>
> > > >      >      >> I'm not really prepared to make large changes to DRM
> > > >     scheduler
> > > >      >     at the
> > > >      >      >> moment for Xe as they are not really required nor does
> > > Boris
> > > >      >     seem they
> > > >      >      >> will be required for his work either. I am interested
> to
> > > see
> > > >      >     what Boris
> > > >      >      >> comes up with.
> > > >      >      >>
> > > >      >      >>> Even on the low level, the idea to replace drm_sched
> > > threads
> > > >      >     with workers
> > > >      >      >>> has a few problems.
> > > >      >      >>>
> > > >      >      >>> To start with, the pattern of:
> > > >      >      >>>
> > > >      >      >>>    while (not_stopped) {
> > > >      >      >>>     keep picking jobs
> > > >      >      >>>    }
> > > >      >      >>>
> > > >      >      >>> Feels fundamentally in disagreement with workers
> (while
> > > >      >     obviously fits
> > > >      >      >>> perfectly with the current kthread design).
> > > >      >      >>
> > > >      >      >> The while loop breaks and worker exists if no jobs are
> > > ready.
> > > >      >
> > > >      >
> > > >      > I'm not very familiar with workqueues. What are you saying
> would
> > > fit
> > > >      > better? One scheduling job per work item rather than one big
> work
> > > >     item
> > > >      > which handles all available jobs?
> > > >
> > > >     Yes and no, it indeed IMO does not fit to have a work item which
> is
> > > >     potentially unbound in runtime. But it is a bit moot conceptual
> > > >     mismatch
> > > >     because it is a worst case / theoretical, and I think due more
> > > >     fundamental concerns.
> > > >
> > > >     If we have to go back to the low level side of things, I've
> picked
> > > this
> > > >     random spot to consolidate what I have already mentioned and
> perhaps
> > > >     expand.
> > > >
> > > >     To start with, let me pull out some thoughts from workqueue.rst:
> > > >
> > > >     """
> > > >     Generally, work items are not expected to hog a CPU and consume
> many
> > > >     cycles. That means maintaining just enough concurrency to prevent
> > > work
> > > >     processing from stalling should be optimal.
> > > >     """
> > > >
> > > >     For unbound queues:
> > > >     """
> > > >     The responsibility of regulating concurrency level is on the
> users.
> > > >     """
> > > >
> > > >     Given the unbound queues will be spawned on demand to service all
> > > >     queued
> > > >     work items (more interesting when mixing up with the
> > > >     system_unbound_wq),
> > > >     in the proposed design the number of instantiated worker threads
> does
> > > >     not correspond to the number of user threads (as you have
> elsewhere
> > > >     stated), but pessimistically to the number of active user
> contexts.
> > > >
> > > >
> > > > Those are pretty much the same in practice.  Rather, user threads is
> > > > typically an upper bound on the number of contexts.  Yes, a single
> user
> > > > thread could have a bunch of contexts but basically nothing does that
> > > > except IGT.  In real-world usage, it's at most one context per user
> > > thread.
> > >
> > > Typically is the key here. But I am not sure it is good enough.
> Consider
> > > this example - Intel Flex 170:
> > >
> > >   * Delivers up to 36 streams 1080p60 transcode throughput per card.
> > >   * When scaled to 10 cards in a 4U server configuration, it can
> support
> > > up to 360 streams of HEVC/HEVC 1080p60 transcode throughput.
> > >
> >
> > I had a feeling it was going to be media.... 😅
> >
>
> Yea wondering the media UMD can be rewritten to use less xe_engines, it
> is massive rewrite for VM bind + no implicit dependencies so let's just
> pile on some more work?
>

It could probably use fewer than it does today.  It currently creates and
throws away contexts like crazy, or did last I looked at it.  However, the
nature of media encode is that it often spreads across two or three
different types of engines.  There's not much you can do to change that.

> >
> > > One transcode stream from my experience typically is 3-4 GPU contexts
> > > (buffer travels from vcs -> rcs -> vcs, maybe vecs) used from a single
> > > CPU thread. 4 contexts * 36 streams = 144 active contexts. Multiply by
> > > 60fps = 8640 jobs submitted and completed per second.
> > >
> > > 144 active contexts in the proposed scheme means possibly means 144
> > > kernel worker threads spawned (driven by 36 transcode CPU threads). (I
> > > don't think the pools would scale down given all are constantly pinged
> > > at 60fps.)
> > >
> > > And then each of 144 threads goes to grab the single GuC CT mutex.
> First
> > > threads are being made schedulable, then put to sleep as mutex
> > > contention is hit, then woken again as mutexes are getting released,
> > > rinse, repeat.
> > >
> >
> > Why is every submission grabbing the GuC CT mutex?  I've not read the GuC
> > back-end yet but I was under the impression that most run_job() would be
> > just shoving another packet into a ring buffer.  If we have to send the
> GuC
> > a message on the control ring every single time we submit a job, that's
> > pretty horrible.
> >
>
> Run job writes the ring buffer and moves the tail as the first step (no
> lock required). Next it needs to tell the GuC the xe_engine LRC tail has
> moved, this is done from a single Host to GuC channel which is circular
> buffer, the writing of the channel protected by the mutex. There are
> little more nuances too but in practice there is always space in the
> channel so the time mutex needs to held is really, really small
> (check cached credits, write 3 dwords in payload, write 1 dword to move
> tail). I also believe mutexes in Linux are hybrid where they spin for a
> little bit before sleeping and certainly if there is space in the
> channel we shouldn't sleep mutex contention.
>

Ok, that makes sense.  It's maybe a bit clunky and it'd be nice if we had
some way to batch things up a bit so we only have to poke the GuC channel
once for every batch of things rather than once per job.  That's maybe
something we can look into as a future improvement; not fundamental.

Generally, though, it sounds like contention could be a real problem if we
end up ping-ponging that lock between cores.  It's going to depend on how
much work it takes to get the next ready thing vs. the cost of that
atomic.  But, also, anything we do is going to potentially run into
contention problems.  *shrug*  If we were going to go for
one-per-HW-engine, we may as well go one-per-device and then we wouldn't
need the lock.  Off the top of my head, that doesn't sound great either but
IDK.

> As far as this being horrible, well didn't design the GuC and this how
> it is implemented for KMD based submission. We also have 256 doorbells
> so we wouldn't need a lock but I think are other issues with that design
> too which need to be worked out in the Xe2 / Xe3 timeframe.
>

Yeah, not blaming you.  Just surprised, that's all.  How does it work for
userspace submission?  What would it look like if the kernel emulated
userspace submission?  Is that even possible?

What are these doorbell things?  How do they play into it?

> Also if you see my follow up response Xe is ~33k execs per second with
> the current implementation on a 8 core (or maybe 8 thread) TGL which
> seems to fine to me.
>

33k exec/sec is about 500/frame which should be fine.  500 is a lot for a
single frame.  I typically tell game devs to shoot for dozens per frame.
The important thing is that it stays low even with hundreds of memory
objects bound.  (Xe should be just fine there.)

--Jason

> Matt
>
> > --Jason
> >
> >
> > (And yes this backend contention is there regardless of 1:1:1, it would
> > > require a different re-design to solve that. But it is just a question
> > > whether there are 144 contending threads, or just 6 with the thread per
> > > engine class scheme.)
> > >
> > > Then multiply all by 10 for a 4U server use case and you get 1440
> worker
> > > kthreads, yes 10 more CT locks, but contending on how many CPU cores?
> > > Just so they can grab a timeslice and maybe content on a mutex as the
> > > next step.
> > >
> > > This example is where it would hurt on large systems. Imagine only an
> > > even wider media transcode card...
> > >
> > > Second example is only a single engine class used (3d desktop?) but
> with
> > > a bunch of not-runnable jobs queued and waiting on a fence to signal.
> > > Implicit or explicit dependencies doesn't matter. Then the fence
> signals
> > > and call backs run. N work items get scheduled, but they all submit to
> > > the same HW engine. So we end up with:
> > >
> > >          /-- wi1 --\
> > >         / ..     .. \
> > >   cb --+---  wi.. ---+-- rq1 -- .. -- rqN
> > >         \ ..    ..  /
> > >          \-- wiN --/
> > >
> > >
> > > All that we have achieved is waking up N CPUs to contend on the same
> > > lock and effectively insert the job into the same single HW queue. I
> > > don't see any positives there.
> > >
> > > This example I think can particularly hurt small / low power devices
> > > because of needless waking up of many cores for no benefit. Granted, I
> > > don't have a good feel on how common this pattern is in practice.
> > >
> > > >
> > > >     That
> > > >     is the number which drives the maximum number of not-runnable
> jobs
> > > that
> > > >     can become runnable at once, and hence spawn that many work
> items,
> > > and
> > > >     in turn unbound worker threads.
> > > >
> > > >     Several problems there.
> > > >
> > > >     It is fundamentally pointless to have potentially that many more
> > > >     threads
> > > >     than the number of CPU cores - it simply creates a scheduling
> storm.
> > > >
> > > >     Unbound workers have no CPU / cache locality either and no
> connection
> > > >     with the CPU scheduler to optimize scheduling patterns. This may
> > > matter
> > > >     either on large systems or on small ones. Whereas the current
> design
> > > >     allows for scheduler to notice userspace CPU thread keeps waking
> up
> > > the
> > > >     same drm scheduler kernel thread, and so it can keep them on the
> same
> > > >     CPU, the unbound workers lose that ability and so 2nd CPU might
> be
> > > >     getting woken up from low sleep for every submission.
> > > >
> > > >     Hence, apart from being a bit of a impedance mismatch, the
> proposal
> > > has
> > > >     the potential to change performance and power patterns and both
> large
> > > >     and small machines.
> > > >
> > > >
> > > > Ok, thanks for explaining the issue you're seeing in more detail.
> Yes,
> > > > deferred kwork does appear to mismatch somewhat with what the
> scheduler
> > > > needs or at least how it's worked in the past.  How much impact will
> > > > that mismatch have?  Unclear.
> > > >
> > > >      >      >>> Secondly, it probably demands separate workers (not
> > > >     optional),
> > > >      >     otherwise
> > > >      >      >>> behaviour of shared workqueues has either the
> potential
> > > to
> > > >      >     explode number
> > > >      >      >>> kernel threads anyway, or add latency.
> > > >      >      >>>
> > > >      >      >>
> > > >      >      >> Right now the system_unbound_wq is used which does
> have a
> > > >     limit
> > > >      >     on the
> > > >      >      >> number of threads, right? I do have a FIXME to allow a
> > > >     worker to be
> > > >      >      >> passed in similar to TDR.
> > > >      >      >>
> > > >      >      >> WRT to latency, the 1:1 ratio could actually have
> lower
> > > >     latency
> > > >      >     as 2 GPU
> > > >      >      >> schedulers can be pushing jobs into the backend /
> > > cleaning up
> > > >      >     jobs in
> > > >      >      >> parallel.
> > > >      >      >>
> > > >      >      >
> > > >      >      > Thought of one more point here where why in Xe we
> > > >     absolutely want
> > > >      >     a 1 to
> > > >      >      > 1 ratio between entity and scheduler - the way we
> implement
> > > >      >     timeslicing
> > > >      >      > for preempt fences.
> > > >      >      >
> > > >      >      > Let me try to explain.
> > > >      >      >
> > > >      >      > Preempt fences are implemented via the generic
> messaging
> > > >      >     interface [1]
> > > >      >      > with suspend / resume messages. If a suspend messages
> is
> > > >     received to
> > > >      >      > soon after calling resume (this is per entity) we
> simply
> > > >     sleep in the
> > > >      >      > suspend call thus giving the entity a timeslice. This
> > > >     completely
> > > >      >     falls
> > > >      >      > apart with a many to 1 relationship as now a entity
> > > >     waiting for a
> > > >      >      > timeslice blocks the other entities. Could we work
> aroudn
> > > >     this,
> > > >      >     sure but
> > > >      >      > just another bunch of code we'd have to add in Xe.
> Being to
> > > >      >     freely sleep
> > > >      >      > in backend without affecting other entities is really,
> > > really
> > > >      >     nice IMO
> > > >      >      > and I bet Xe isn't the only driver that is going to
> feel
> > > >     this way.
> > > >      >      >
> > > >      >      > Last thing I'll say regardless of how anyone feels
> about
> > > >     Xe using
> > > >      >     a 1 to
> > > >      >      > 1 relationship this patch IMO makes sense as I hope we
> can
> > > all
> > > >      >     agree a
> > > >      >      > workqueue scales better than kthreads.
> > > >      >
> > > >      >     I don't know for sure what will scale better and for what
> use
> > > >     case,
> > > >      >     combination of CPU cores vs number of GPU engines to keep
> > > >     busy vs other
> > > >      >     system activity. But I wager someone is bound to ask for
> some
> > > >      >     numbers to
> > > >      >     make sure proposal is not negatively affecting any other
> > > drivers.
> > > >      >
> > > >      >
> > > >      > Then let them ask.  Waving your hands vaguely in the
> direction of
> > > >     the
> > > >      > rest of DRM and saying "Uh, someone (not me) might object" is
> > > >     profoundly
> > > >      > unhelpful.  Sure, someone might.  That's why it's on
> dri-devel.
> > > >     If you
> > > >      > think there's someone in particular who might have a useful
> > > >     opinion on
> > > >      > this, throw them in the CC so they don't miss the e-mail
> thread.
> > > >      >
> > > >      > Or are you asking for numbers?  If so, what numbers are you
> > > >     asking for?
> > > >
> > > >     It was a heads up to the Xe team in case people weren't
> appreciating
> > > >     how
> > > >     the proposed change has the potential influence power and
> performance
> > > >     across the board. And nothing in the follow up discussion made me
> > > think
> > > >     it was considered so I don't think it was redundant to raise it.
> > > >
> > > >     In my experience it is typical that such core changes come with
> some
> > > >     numbers. Which is in case of drm scheduler is tricky and probably
> > > >     requires explicitly asking everyone to test (rather than count on
> > > >     "don't
> > > >     miss the email thread"). Real products can fail to ship due ten
> mW
> > > here
> > > >     or there. Like suddenly an extra core prevented from getting into
> > > deep
> > > >     sleep.
> > > >
> > > >     If that was "profoundly unhelpful" so be it.
> > > >
> > > >
> > > > With your above explanation, it makes more sense what you're asking.
> > > > It's still not something Matt is likely to be able to provide on his
> > > > own.  We need to tag some other folks and ask them to test it out.
> We
> > > > could play around a bit with it on Xe but it's not exactly production
> > > > grade yet and is going to hit this differently from most.  Likely
> > > > candidates are probably AMD and Freedreno.
> > >
> > > Whoever is setup to check out power and performance would be good to
> > > give it a spin, yes.
> > >
> > > PS. I don't think I was asking Matt to test with other devices. To
> start
> > > with I think Xe is a team effort. I was asking for more background on
> > > the design decision since patch 4/20 does not say anything on that
> > > angle, nor later in the thread it was IMO sufficiently addressed.
> > >
> > > >      > Also, If we're talking about a design that might paint us
> into an
> > > >      > Intel-HW-specific hole, that would be one thing.  But we're
> not.
> > > >     We're
> > > >      > talking about switching which kernel threading/task mechanism
> to
> > > >     use for
> > > >      > what's really a very generic problem.  The core Xe design
> works
> > > >     without
> > > >      > this patch (just with more kthreads).  If we land this patch
> or
> > > >      > something like it and get it wrong and it causes a performance
> > > >     problem
> > > >      > for someone down the line, we can revisit it.
> > > >
> > > >     For some definition of "it works" - I really wouldn't suggest
> > > >     shipping a
> > > >     kthread per user context at any point.
> > > >
> > > >
> > > > You have yet to elaborate on why. What resources is it consuming
> that's
> > > > going to be a problem? Are you anticipating CPU affinity problems? Or
> > > > does it just seem wasteful?
> > >
> > > Well I don't know, commit message says the approach does not scale. :)
> > >
> > > > I think I largely agree that it's probably unnecessary/wasteful but
> > > > reducing the number of kthreads seems like a tractable problem to
> solve
> > > > regardless of where we put the gpu_scheduler object.  Is this the
> right
> > > > solution?  Maybe not.  It was also proposed at one point that we
> could
> > > > split the scheduler into two pieces: A scheduler which owns the
> kthread,
> > > > and a back-end which targets some HW ring thing where you can have
> > > > multiple back-ends per scheduler.  That's certainly more invasive
> from a
> > > > DRM scheduler internal API PoV but would solve the kthread problem
> in a
> > > > way that's more similar to what we have now.
> > > >
> > > >      >     In any case that's a low level question caused by the high
> > > >     level design
> > > >      >     decision. So I'd think first focus on the high level -
> which
> > > >     is the 1:1
> > > >      >     mapping of entity to scheduler instance proposal.
> > > >      >
> > > >      >     Fundamentally it will be up to the DRM maintainers and the
> > > >     community to
> > > >      >     bless your approach. And it is important to stress 1:1 is
> > > about
> > > >      >     userspace contexts, so I believe unlike any other current
> > > >     scheduler
> > > >      >     user. And also important to stress this effectively does
> not
> > > >     make Xe
> > > >      >     _really_ use the scheduler that much.
> > > >      >
> > > >      >
> > > >      > I don't think this makes Xe nearly as much of a one-off as you
> > > >     think it
> > > >      > does.  I've already told the Asahi team working on Apple M1/2
> > > >     hardware
> > > >      > to do it this way and it seems to be a pretty good mapping for
> > > >     them. I
> > > >      > believe this is roughly the plan for nouveau as well.  It's
> not
> > > >     the way
> > > >      > it currently works for anyone because most other groups aren't
> > > >     doing FW
> > > >      > scheduling yet.  In the world of FW scheduling and hardware
> > > >     designed to
> > > >      > support userspace direct-to-FW submit, I think the design
> makes
> > > >     perfect
> > > >      > sense (see below) and I expect we'll see more drivers move in
> this
> > > >      > direction as those drivers evolve.  (AMD is doing some
> customish
> > > >     thing
> > > >      > for how with gpu_scheduler on the front-end somehow. I've not
> dug
> > > >     into
> > > >      > those details.)
> > > >      >
> > > >      >     I can only offer my opinion, which is that the two options
> > > >     mentioned in
> > > >      >     this thread (either improve drm scheduler to cope with
> what is
> > > >      >     required,
> > > >      >     or split up the code so you can use just the parts of
> > > >     drm_sched which
> > > >      >     you want - which is frontend dependency tracking)
> shouldn't
> > > be so
> > > >      >     readily dismissed, given how I think the idea was for the
> new
> > > >     driver to
> > > >      >     work less in a silo and more in the community (not do
> kludges
> > > to
> > > >      >     workaround stuff because it is thought to be too hard to
> > > >     improve common
> > > >      >     code), but fundamentally, "goto previous paragraph" for
> what
> > > I am
> > > >      >     concerned.
> > > >      >
> > > >      >
> > > >      > Meta comment:  It appears as if you're falling into the
> standard
> > > >     i915
> > > >      > team trap of having an internal discussion about what the
> > > community
> > > >      > discussion might look like instead of actually having the
> > > community
> > > >      > discussion.  If you are seriously concerned about interactions
> > > with
> > > >      > other drivers or whether or setting common direction, the
> right
> > > >     way to
> > > >      > do that is to break a patch or two out into a separate RFC
> series
> > > >     and
> > > >      > tag a handful of driver maintainers.  Trying to predict the
> > > >     questions
> > > >      > other people might ask is pointless. Cc them and asking for
> their
> > > >     input
> > > >      > instead.
> > > >
> > > >     I don't follow you here. It's not an internal discussion - I am
> > > raising
> > > >     my concerns on the design publicly. I am supposed to write a
> patch to
> > > >     show something, but am allowed to comment on a RFC series?
> > > >
> > > >
> > > > I may have misread your tone a bit.  It felt a bit like too many
> > > > discussions I've had in the past where people are trying to predict
> what
> > > > others will say instead of just asking them.  Reading it again, I was
> > > > probably jumping to conclusions a bit.  Sorry about that.
> > >
> > > Okay no problem, thanks. In any case we don't have to keep discussing
> > > it, since I wrote one or two emails ago it is fundamentally on the
> > > maintainers and community to ack the approach. I only felt like RFC did
> > > not explain the potential downsides sufficiently so I wanted to probe
> > > that area a bit.
> > >
> > > >     It is "drm/sched: Convert drm scheduler to use a work queue
> rather
> > > than
> > > >     kthread" which should have Cc-ed _everyone_ who use drm
> scheduler.
> > > >
> > > >
> > > > Yeah, it probably should have.  I think that's mostly what I've been
> > > > trying to say.
> > > >
> > > >      >
> > > >      >     Regards,
> > > >      >
> > > >      >     Tvrtko
> > > >      >
> > > >      >     P.S. And as a related side note, there are more areas
> where
> > > >     drm_sched
> > > >      >     could be improved, like for instance priority handling.
> > > >      >     Take a look at msm_submitqueue_create /
> > > >     msm_gpu_convert_priority /
> > > >      >     get_sched_entity to see how msm works around the drm_sched
> > > >     hardcoded
> > > >      >     limit of available priority levels, in order to avoid
> having
> > > >     to leave a
> > > >      >     hw capability unused. I suspect msm would be happier if
> they
> > > >     could have
> > > >      >     all priority levels equal in terms of whether they apply
> only
> > > >     at the
> > > >      >     frontend level or completely throughout the pipeline.
> > > >      >
> > > >      >      > [1]
> > > >      >
> > > >
> https://patchwork.freedesktop.org/patch/515857/?series=112189&rev=1
> > > >     <
> https://patchwork.freedesktop.org/patch/515857/?series=112189&rev=1
> > > >
> > > >      >
> > > >       <
> > > https://patchwork.freedesktop.org/patch/515857/?series=112189&rev=1 <
> > > https://patchwork.freedesktop.org/patch/515857/?series=112189&rev=1>>
> > > >      >      >
> > > >      >      >>> What would be interesting to learn is whether the
> option
> > > of
> > > >      >     refactoring
> > > >      >      >>> drm_sched to deal with out of order completion was
> > > >     considered
> > > >      >     and what were
> > > >      >      >>> the conclusions.
> > > >      >      >>>
> > > >      >      >>
> > > >      >      >> I coded this up a while back when trying to convert
> the
> > > >     i915 to
> > > >      >     the DRM
> > > >      >      >> scheduler it isn't all that hard either. The free flow
> > > >     control
> > > >      >     on the
> > > >      >      >> ring (e.g. set job limit == SIZE OF RING / MAX JOB
> SIZE)
> > > is
> > > >      >     really what
> > > >      >      >> sold me on the this design.
> > > >      >
> > > >      >
> > > >      > You're not the only one to suggest supporting out-of-order
> > > >     completion.
> > > >      > However, it's tricky and breaks a lot of internal assumptions
> of
> > > the
> > > >      > scheduler. It also reduces functionality a bit because it can
> no
> > > >     longer
> > > >      > automatically rate-limit HW/FW queues which are often
> > > >     fixed-size.  (Ok,
> > > >      > yes, it probably could but it becomes a substantially harder
> > > >     problem.)
> > > >      >
> > > >      > It also seems like a worse mapping to me.  The goal here is to
> > > turn
> > > >      > submissions on a userspace-facing engine/queue into
> submissions
> > > >     to a FW
> > > >      > queue submissions, sorting out any dma_fence dependencies.
> Matt's
> > > >      > description of saying this is a 1:1 mapping between
> sched/entity
> > > >     doesn't
> > > >      > tell the whole story. It's a 1:1:1 mapping between xe_engine,
> > > >      > gpu_scheduler, and GuC FW engine.  Why make it a 1:something:1
> > > >     mapping?
> > > >      > Why is that better?
> > > >
> > > >     As I have stated before, what I think what would fit well for Xe
> is
> > > one
> > > >     drm_scheduler per engine class. In specific terms on our current
> > > >     hardware, one drm scheduler instance for render, compute,
> blitter,
> > > >     video
> > > >     and video enhance. Userspace contexts remain scheduler entities.
> > > >
> > > >
> > > > And this is where we fairly strongly disagree.  More in a bit.
> > > >
> > > >     That way you avoid the whole kthread/kworker story and you have
> it
> > > >     actually use the entity picking code in the scheduler, which may
> be
> > > >     useful when the backend is congested.
> > > >
> > > >
> > > > What back-end congestion are you referring to here?  Running out of
> FW
> > > > queue IDs?  Something else?
> > >
> > > CT channel, number of context ids.
> > >
> > > >
> > > >     Yes you have to solve the out of order problem so in my mind
> that is
> > > >     something to discuss. What the problem actually is (just TDR?),
> how
> > > >     tricky and why etc.
> > > >
> > > >     And yes you lose the handy LRCA ring buffer size management so
> you'd
> > > >     have to make those entities not runnable in some other way.
> > > >
> > > >     Regarding the argument you raise below - would any of that make
> the
> > > >     frontend / backend separation worse and why? Do you think it is
> less
> > > >     natural? If neither is true then all remains is that it appears
> extra
> > > >     work to support out of order completion of entities has been
> > > discounted
> > > >     in favour of an easy but IMO inelegant option.
> > > >
> > > >
> > > > Broadly speaking, the kernel needs to stop thinking about GPU
> scheduling
> > > > in terms of scheduling jobs and start thinking in terms of scheduling
> > > > contexts/engines.  There is still some need for scheduling individual
> > > > jobs but that is only for the purpose of delaying them as needed to
> > > > resolve dma_fence dependencies.  Once dependencies are resolved, they
> > > > get shoved onto the context/engine queue and from there the kernel
> only
> > > > really manages whole contexts/engines.  This is a major architectural
> > > > shift, entirely different from the way i915 scheduling works.  It's
> also
> > > > different from the historical usage of DRM scheduler which I think is
> > > > why this all looks a bit funny.
> > > >
> > > > To justify this architectural shift, let's look at where we're
> headed.
> > > > In the glorious future...
> > > >
> > > >   1. Userspace submits directly to firmware queues.  The kernel has
> no
> > > > visibility whatsoever into individual jobs.  At most it can
> pause/resume
> > > > FW contexts as needed to handle eviction and memory management.
> > > >
> > > >   2. Because of 1, apart from handing out the FW queue IDs at the
> > > > beginning, the kernel can't really juggle them that much.  Depending
> on
> > > > FW design, it may be able to pause a client, give its IDs to another,
> > > > and then resume it later when IDs free up.  What it's not doing is
> > > > juggling IDs on a job-by-job basis like i915 currently is.
> > > >
> > > >   3. Long-running compute jobs may not complete for days.  This means
> > > > that memory management needs to happen in terms of pause/resume of
> > > > entire contexts/engines using the memory rather than based on waiting
> > > > for individual jobs to complete or pausing individual jobs until the
> > > > memory is available.
> > > >
> > > >   4. Synchronization happens via userspace memory fences (UMF) and
> the
> > > > kernel is mostly unaware of most dependencies and when a
> context/engine
> > > > is or is not runnable.  Instead, it keeps as many of them minimally
> > > > active (memory is available, even if it's in system RAM) as possible
> and
> > > > lets the FW sort out dependencies.  (There may need to be some
> facility
> > > > for sleeping a context until a memory change similar to futex() or
> > > > poll() for userspace threads.  There are some details TBD.)
> > > >
> > > > Are there potential problems that will need to be solved here?
> Yes.  Is
> > > > it a good design?  Well, Microsoft has been living in this future for
> > > > half a decade or better and it's working quite well for them.  It's
> also
> > > > the way all modern game consoles work.  It really is just Linux
> that's
> > > > stuck with the same old job model we've had since the monumental
> shift
> > > > to DRI2.
> > > >
> > > > To that end, one of the core goals of the Xe project was to make the
> > > > driver internally behave as close to the above model as possible
> while
> > > > keeping the old-school job model as a very thin layer on top.  As the
> > > > broader ecosystem problems (window-system support for UMF, for
> instance)
> > > > are solved, that layer can be peeled back.  The core driver will
> already
> > > > be ready for it.
> > > >
> > > > To that end, the point of the DRM scheduler in Xe isn't to schedule
> > > > jobs.  It's to resolve syncobj and dma-buf implicit sync dependencies
> > > > and stuff jobs into their respective context/engine queue once
> they're
> > > > ready.  All the actual scheduling happens in firmware and any
> scheduling
> > > > the kernel does to deal with contention, oversubscriptions, too many
> > > > contexts, etc. is between contexts/engines, not individual jobs.
> Sure,
> > > > the individual job visibility is nice, but if we design around it,
> we'll
> > > > never get to the glorious future.
> > > >
> > > > I really need to turn the above (with a bit more detail) into a blog
> > > > post.... Maybe I'll do that this week.
> > > >
> > > > In any case, I hope that provides more insight into why Xe is
> designed
> > > > the way it is and why I'm pushing back so hard on trying to make it
> more
> > > > of a "classic" driver as far as scheduling is concerned.  Are there
> > > > potential problems here?  Yes, that's why Xe has been labeled a
> > > > prototype.  Are such radical changes necessary to get to said
> glorious
> > > > future?  Yes, I think they are.  Will it be worth it?  I believe so.
> > >
> > > Right, that's all solid I think. My takeaway is that frontend priority
> > > sorting and that stuff isn't needed and that is okay. And that there
> are
> > > multiple options to maybe improve drm scheduler, like the fore
> mentioned
> > > making it deal with out of order, or split into functional components,
> > > or split frontend/backend what you suggested. For most of them cost vs
> > > benefit is more or less not completely clear, neither how much effort
> > > was invested to look into them.
> > >
> > > One thing I missed from this explanation is how drm_scheduler per
> engine
> > > class interferes with the high level concepts. And I did not manage to
> > > pick up on what exactly is the TDR problem in that case. Maybe the two
> > > are one and the same.
> > >
> > > Bottom line is I still have the concern that conversion to kworkers has
> > > an opportunity to regress. Possibly more opportunity for some Xe use
> > > cases than to affect other vendors, since they would still be using per
> > > physical engine / queue scheduler instances.
> > >
> > > And to put my money where my mouth is I will try to put testing Xe
> > > inside the full blown ChromeOS environment in my team plans. It would
> > > probably also be beneficial if Xe team could take a look at real world
> > > behaviour of the extreme transcode use cases too. If the stack is ready
> > > for that and all. It would be better to know earlier rather than later
> > > if there is a fundamental issue.
> > >
> > > For the patch at hand, and the cover letter, it certainly feels it
> would
> > > benefit to record the past design discussion had with AMD folks, to
> > > explicitly copy other drivers, and to record the theoretical pros and
> > > cons of threads vs unbound workers as I have tried to highlight them.
> > >
> > > Regards,
> > >
> > > Tvrtko
> > >
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://lists.freedesktop.org/archives/dri-devel/attachments/20230111/77330b48/attachment-0001.htm>