<div dir="ltr"><div class="gmail_quote"><div dir="ltr" class="gmail_attr">On Wed, Jan 11, 2023 at 2:50 AM Tvrtko Ursulin <<a href="mailto:tvrtko.ursulin@linux.intel.com">tvrtko.ursulin@linux.intel.com</a>> wrote:<br></div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex"><br>
On 10/01/2023 14:08, Jason Ekstrand wrote:<br>
> On Tue, Jan 10, 2023 at 5:28 AM Tvrtko Ursulin <br>
> <<a href="mailto:tvrtko.ursulin@linux.intel.com" target="_blank">tvrtko.ursulin@linux.intel.com</a> <mailto:<a href="mailto:tvrtko.ursulin@linux.intel.com" target="_blank">tvrtko.ursulin@linux.intel.com</a>>> <br>
> wrote:<br>
> <br>
> <br>
> <br>
> On 09/01/2023 17:27, Jason Ekstrand wrote:<br>
> <br>
> [snip]<br>
> <br>
> > >>> AFAICT it proposes to have 1:1 between *userspace* created<br>
> > contexts (per<br>
> > >>> context _and_ engine) and drm_sched. I am not sure avoiding<br>
> > invasive changes<br>
> > >>> to the shared code is in the spirit of the overall idea<br>
> and instead<br>
> > >>> opportunity should be used to look at way to<br>
> refactor/improve<br>
> > drm_sched.<br>
> ><br>
> ><br>
> > Maybe? I'm not convinced that what Xe is doing is an abuse at<br>
> all or<br>
> > really needs to drive a re-factor. (More on that later.) <br>
> There's only<br>
> > one real issue which is that it fires off potentially a lot of<br>
> kthreads.<br>
> > Even that's not that bad given that kthreads are pretty light and<br>
> you're<br>
> > not likely to have more kthreads than userspace threads which are<br>
> much<br>
> > heavier. Not ideal, but not the end of the world either. <br>
> Definitely<br>
> > something we can/should optimize but if we went through with Xe<br>
> without<br>
> > this patch, it would probably be mostly ok.<br>
> ><br>
> > >> Yes, it is 1:1 *userspace* engines and drm_sched.<br>
> > >><br>
> > >> I'm not really prepared to make large changes to DRM<br>
> scheduler<br>
> > at the<br>
> > >> moment for Xe as they are not really required nor does Boris<br>
> > seem they<br>
> > >> will be required for his work either. I am interested to see<br>
> > what Boris<br>
> > >> comes up with.<br>
> > >><br>
> > >>> Even on the low level, the idea to replace drm_sched threads<br>
> > with workers<br>
> > >>> has a few problems.<br>
> > >>><br>
> > >>> To start with, the pattern of:<br>
> > >>><br>
> > >>> while (not_stopped) {<br>
> > >>> keep picking jobs<br>
> > >>> }<br>
> > >>><br>
> > >>> Feels fundamentally in disagreement with workers (while<br>
> > obviously fits<br>
> > >>> perfectly with the current kthread design).<br>
> > >><br>
> > >> The while loop breaks and worker exists if no jobs are ready.<br>
> ><br>
> ><br>
> > I'm not very familiar with workqueues. What are you saying would fit<br>
> > better? One scheduling job per work item rather than one big work<br>
> item<br>
> > which handles all available jobs?<br>
> <br>
> Yes and no, it indeed IMO does not fit to have a work item which is<br>
> potentially unbound in runtime. But it is a bit moot conceptual<br>
> mismatch<br>
> because it is a worst case / theoretical, and I think due more<br>
> fundamental concerns.<br>
> <br>
> If we have to go back to the low level side of things, I've picked this<br>
> random spot to consolidate what I have already mentioned and perhaps<br>
> expand.<br>
> <br>
> To start with, let me pull out some thoughts from workqueue.rst:<br>
> <br>
> """<br>
> Generally, work items are not expected to hog a CPU and consume many<br>
> cycles. That means maintaining just enough concurrency to prevent work<br>
> processing from stalling should be optimal.<br>
> """<br>
> <br>
> For unbound queues:<br>
> """<br>
> The responsibility of regulating concurrency level is on the users.<br>
> """<br>
> <br>
> Given the unbound queues will be spawned on demand to service all<br>
> queued<br>
> work items (more interesting when mixing up with the<br>
> system_unbound_wq),<br>
> in the proposed design the number of instantiated worker threads does<br>
> not correspond to the number of user threads (as you have elsewhere<br>
> stated), but pessimistically to the number of active user contexts.<br>
> <br>
> <br>
> Those are pretty much the same in practice. Rather, user threads is <br>
> typically an upper bound on the number of contexts. Yes, a single user <br>
> thread could have a bunch of contexts but basically nothing does that <br>
> except IGT. In real-world usage, it's at most one context per user thread.<br>
<br>
Typically is the key here. But I am not sure it is good enough. Consider <br>
this example - Intel Flex 170:<br>
<br>
* Delivers up to 36 streams 1080p60 transcode throughput per card.<br>
* When scaled to 10 cards in a 4U server configuration, it can support <br>
up to 360 streams of HEVC/HEVC 1080p60 transcode throughput.<br></blockquote><div><br></div><div>I had a feeling it was going to be media.... 😅<br></div><div> </div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">
One transcode stream from my experience typically is 3-4 GPU contexts <br>
(buffer travels from vcs -> rcs -> vcs, maybe vecs) used from a single <br>
CPU thread. 4 contexts * 36 streams = 144 active contexts. Multiply by <br>
60fps = 8640 jobs submitted and completed per second.<br>
<br>
144 active contexts in the proposed scheme means possibly means 144 <br>
kernel worker threads spawned (driven by 36 transcode CPU threads). (I <br>
don't think the pools would scale down given all are constantly pinged <br>
at 60fps.)<br>
<br>
And then each of 144 threads goes to grab the single GuC CT mutex. First <br>
threads are being made schedulable, then put to sleep as mutex <br>
contention is hit, then woken again as mutexes are getting released, <br>
rinse, repeat.<br></blockquote><div><br></div><div>Why is every submission grabbing the GuC CT mutex? I've not read the GuC back-end yet but I was under the impression that most run_job() would be just shoving another packet into a ring buffer. If we have to send the GuC a message on the control ring every single time we submit a job, that's pretty horrible.</div><div><br></div><div>--Jason</div><div><br></div><div><br></div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">
(And yes this backend contention is there regardless of 1:1:1, it would <br>
require a different re-design to solve that. But it is just a question <br>
whether there are 144 contending threads, or just 6 with the thread per <br>
engine class scheme.)<br>
<br>
Then multiply all by 10 for a 4U server use case and you get 1440 worker <br>
kthreads, yes 10 more CT locks, but contending on how many CPU cores? <br>
Just so they can grab a timeslice and maybe content on a mutex as the <br>
next step.<br>
<br>
This example is where it would hurt on large systems. Imagine only an <br>
even wider media transcode card...<br>
<br>
Second example is only a single engine class used (3d desktop?) but with <br>
a bunch of not-runnable jobs queued and waiting on a fence to signal. <br>
Implicit or explicit dependencies doesn't matter. Then the fence signals <br>
and call backs run. N work items get scheduled, but they all submit to <br>
the same HW engine. So we end up with:<br>
<br>
/-- wi1 --\<br>
/ .. .. \<br>
cb --+--- wi.. ---+-- rq1 -- .. -- rqN<br>
\ .. .. /<br>
\-- wiN --/<br>
<br>
<br>
All that we have achieved is waking up N CPUs to contend on the same <br>
lock and effectively insert the job into the same single HW queue. I <br>
don't see any positives there.<br>
<br>
This example I think can particularly hurt small / low power devices <br>
because of needless waking up of many cores for no benefit. Granted, I <br>
don't have a good feel on how common this pattern is in practice.<br>
<br>
> <br>
> That<br>
> is the number which drives the maximum number of not-runnable jobs that<br>
> can become runnable at once, and hence spawn that many work items, and<br>
> in turn unbound worker threads.<br>
> <br>
> Several problems there.<br>
> <br>
> It is fundamentally pointless to have potentially that many more<br>
> threads<br>
> than the number of CPU cores - it simply creates a scheduling storm.<br>
> <br>
> Unbound workers have no CPU / cache locality either and no connection<br>
> with the CPU scheduler to optimize scheduling patterns. This may matter<br>
> either on large systems or on small ones. Whereas the current design<br>
> allows for scheduler to notice userspace CPU thread keeps waking up the<br>
> same drm scheduler kernel thread, and so it can keep them on the same<br>
> CPU, the unbound workers lose that ability and so 2nd CPU might be<br>
> getting woken up from low sleep for every submission.<br>
> <br>
> Hence, apart from being a bit of a impedance mismatch, the proposal has<br>
> the potential to change performance and power patterns and both large<br>
> and small machines.<br>
> <br>
> <br>
> Ok, thanks for explaining the issue you're seeing in more detail. Yes, <br>
> deferred kwork does appear to mismatch somewhat with what the scheduler <br>
> needs or at least how it's worked in the past. How much impact will <br>
> that mismatch have? Unclear.<br>
> <br>
> > >>> Secondly, it probably demands separate workers (not<br>
> optional),<br>
> > otherwise<br>
> > >>> behaviour of shared workqueues has either the potential to<br>
> > explode number<br>
> > >>> kernel threads anyway, or add latency.<br>
> > >>><br>
> > >><br>
> > >> Right now the system_unbound_wq is used which does have a<br>
> limit<br>
> > on the<br>
> > >> number of threads, right? I do have a FIXME to allow a<br>
> worker to be<br>
> > >> passed in similar to TDR.<br>
> > >><br>
> > >> WRT to latency, the 1:1 ratio could actually have lower<br>
> latency<br>
> > as 2 GPU<br>
> > >> schedulers can be pushing jobs into the backend / cleaning up<br>
> > jobs in<br>
> > >> parallel.<br>
> > >><br>
> > ><br>
> > > Thought of one more point here where why in Xe we<br>
> absolutely want<br>
> > a 1 to<br>
> > > 1 ratio between entity and scheduler - the way we implement<br>
> > timeslicing<br>
> > > for preempt fences.<br>
> > ><br>
> > > Let me try to explain.<br>
> > ><br>
> > > Preempt fences are implemented via the generic messaging<br>
> > interface [1]<br>
> > > with suspend / resume messages. If a suspend messages is<br>
> received to<br>
> > > soon after calling resume (this is per entity) we simply<br>
> sleep in the<br>
> > > suspend call thus giving the entity a timeslice. This<br>
> completely<br>
> > falls<br>
> > > apart with a many to 1 relationship as now a entity<br>
> waiting for a<br>
> > > timeslice blocks the other entities. Could we work aroudn<br>
> this,<br>
> > sure but<br>
> > > just another bunch of code we'd have to add in Xe. Being to<br>
> > freely sleep<br>
> > > in backend without affecting other entities is really, really<br>
> > nice IMO<br>
> > > and I bet Xe isn't the only driver that is going to feel<br>
> this way.<br>
> > ><br>
> > > Last thing I'll say regardless of how anyone feels about<br>
> Xe using<br>
> > a 1 to<br>
> > > 1 relationship this patch IMO makes sense as I hope we can all<br>
> > agree a<br>
> > > workqueue scales better than kthreads.<br>
> ><br>
> > I don't know for sure what will scale better and for what use<br>
> case,<br>
> > combination of CPU cores vs number of GPU engines to keep<br>
> busy vs other<br>
> > system activity. But I wager someone is bound to ask for some<br>
> > numbers to<br>
> > make sure proposal is not negatively affecting any other drivers.<br>
> ><br>
> ><br>
> > Then let them ask. Waving your hands vaguely in the direction of<br>
> the<br>
> > rest of DRM and saying "Uh, someone (not me) might object" is<br>
> profoundly<br>
> > unhelpful. Sure, someone might. That's why it's on dri-devel. <br>
> If you<br>
> > think there's someone in particular who might have a useful<br>
> opinion on<br>
> > this, throw them in the CC so they don't miss the e-mail thread.<br>
> ><br>
> > Or are you asking for numbers? If so, what numbers are you<br>
> asking for?<br>
> <br>
> It was a heads up to the Xe team in case people weren't appreciating<br>
> how<br>
> the proposed change has the potential influence power and performance<br>
> across the board. And nothing in the follow up discussion made me think<br>
> it was considered so I don't think it was redundant to raise it.<br>
> <br>
> In my experience it is typical that such core changes come with some<br>
> numbers. Which is in case of drm scheduler is tricky and probably<br>
> requires explicitly asking everyone to test (rather than count on<br>
> "don't<br>
> miss the email thread"). Real products can fail to ship due ten mW here<br>
> or there. Like suddenly an extra core prevented from getting into deep<br>
> sleep.<br>
> <br>
> If that was "profoundly unhelpful" so be it.<br>
> <br>
> <br>
> With your above explanation, it makes more sense what you're asking. <br>
> It's still not something Matt is likely to be able to provide on his <br>
> own. We need to tag some other folks and ask them to test it out. We <br>
> could play around a bit with it on Xe but it's not exactly production <br>
> grade yet and is going to hit this differently from most. Likely <br>
> candidates are probably AMD and Freedreno.<br>
<br>
Whoever is setup to check out power and performance would be good to <br>
give it a spin, yes.<br>
<br>
PS. I don't think I was asking Matt to test with other devices. To start <br>
with I think Xe is a team effort. I was asking for more background on <br>
the design decision since patch 4/20 does not say anything on that <br>
angle, nor later in the thread it was IMO sufficiently addressed.<br>
<br>
> > Also, If we're talking about a design that might paint us into an<br>
> > Intel-HW-specific hole, that would be one thing. But we're not. <br>
> We're<br>
> > talking about switching which kernel threading/task mechanism to<br>
> use for<br>
> > what's really a very generic problem. The core Xe design works<br>
> without<br>
> > this patch (just with more kthreads). If we land this patch or<br>
> > something like it and get it wrong and it causes a performance<br>
> problem<br>
> > for someone down the line, we can revisit it.<br>
> <br>
> For some definition of "it works" - I really wouldn't suggest<br>
> shipping a<br>
> kthread per user context at any point.<br>
> <br>
> <br>
> You have yet to elaborate on why. What resources is it consuming that's <br>
> going to be a problem? Are you anticipating CPU affinity problems? Or <br>
> does it just seem wasteful?<br>
<br>
Well I don't know, commit message says the approach does not scale. :)<br>
<br>
> I think I largely agree that it's probably unnecessary/wasteful but <br>
> reducing the number of kthreads seems like a tractable problem to solve <br>
> regardless of where we put the gpu_scheduler object. Is this the right <br>
> solution? Maybe not. It was also proposed at one point that we could <br>
> split the scheduler into two pieces: A scheduler which owns the kthread, <br>
> and a back-end which targets some HW ring thing where you can have <br>
> multiple back-ends per scheduler. That's certainly more invasive from a <br>
> DRM scheduler internal API PoV but would solve the kthread problem in a <br>
> way that's more similar to what we have now.<br>
> <br>
> > In any case that's a low level question caused by the high<br>
> level design<br>
> > decision. So I'd think first focus on the high level - which<br>
> is the 1:1<br>
> > mapping of entity to scheduler instance proposal.<br>
> ><br>
> > Fundamentally it will be up to the DRM maintainers and the<br>
> community to<br>
> > bless your approach. And it is important to stress 1:1 is about<br>
> > userspace contexts, so I believe unlike any other current<br>
> scheduler<br>
> > user. And also important to stress this effectively does not<br>
> make Xe<br>
> > _really_ use the scheduler that much.<br>
> ><br>
> ><br>
> > I don't think this makes Xe nearly as much of a one-off as you<br>
> think it<br>
> > does. I've already told the Asahi team working on Apple M1/2<br>
> hardware<br>
> > to do it this way and it seems to be a pretty good mapping for<br>
> them. I<br>
> > believe this is roughly the plan for nouveau as well. It's not<br>
> the way<br>
> > it currently works for anyone because most other groups aren't<br>
> doing FW<br>
> > scheduling yet. In the world of FW scheduling and hardware<br>
> designed to<br>
> > support userspace direct-to-FW submit, I think the design makes<br>
> perfect<br>
> > sense (see below) and I expect we'll see more drivers move in this<br>
> > direction as those drivers evolve. (AMD is doing some customish<br>
> thing<br>
> > for how with gpu_scheduler on the front-end somehow. I've not dug<br>
> into<br>
> > those details.)<br>
> ><br>
> > I can only offer my opinion, which is that the two options<br>
> mentioned in<br>
> > this thread (either improve drm scheduler to cope with what is<br>
> > required,<br>
> > or split up the code so you can use just the parts of<br>
> drm_sched which<br>
> > you want - which is frontend dependency tracking) shouldn't be so<br>
> > readily dismissed, given how I think the idea was for the new<br>
> driver to<br>
> > work less in a silo and more in the community (not do kludges to<br>
> > workaround stuff because it is thought to be too hard to<br>
> improve common<br>
> > code), but fundamentally, "goto previous paragraph" for what I am<br>
> > concerned.<br>
> ><br>
> ><br>
> > Meta comment: It appears as if you're falling into the standard<br>
> i915<br>
> > team trap of having an internal discussion about what the community<br>
> > discussion might look like instead of actually having the community<br>
> > discussion. If you are seriously concerned about interactions with<br>
> > other drivers or whether or setting common direction, the right<br>
> way to<br>
> > do that is to break a patch or two out into a separate RFC series<br>
> and<br>
> > tag a handful of driver maintainers. Trying to predict the<br>
> questions<br>
> > other people might ask is pointless. Cc them and asking for their<br>
> input<br>
> > instead.<br>
> <br>
> I don't follow you here. It's not an internal discussion - I am raising<br>
> my concerns on the design publicly. I am supposed to write a patch to<br>
> show something, but am allowed to comment on a RFC series?<br>
> <br>
> <br>
> I may have misread your tone a bit. It felt a bit like too many <br>
> discussions I've had in the past where people are trying to predict what <br>
> others will say instead of just asking them. Reading it again, I was <br>
> probably jumping to conclusions a bit. Sorry about that.<br>
<br>
Okay no problem, thanks. In any case we don't have to keep discussing <br>
it, since I wrote one or two emails ago it is fundamentally on the <br>
maintainers and community to ack the approach. I only felt like RFC did <br>
not explain the potential downsides sufficiently so I wanted to probe <br>
that area a bit.<br>
<br>
> It is "drm/sched: Convert drm scheduler to use a work queue rather than<br>
> kthread" which should have Cc-ed _everyone_ who use drm scheduler.<br>
> <br>
> <br>
> Yeah, it probably should have. I think that's mostly what I've been <br>
> trying to say.<br>
> <br>
> ><br>
> > Regards,<br>
> ><br>
> > Tvrtko<br>
> ><br>
> > P.S. And as a related side note, there are more areas where<br>
> drm_sched<br>
> > could be improved, like for instance priority handling.<br>
> > Take a look at msm_submitqueue_create /<br>
> msm_gpu_convert_priority /<br>
> > get_sched_entity to see how msm works around the drm_sched<br>
> hardcoded<br>
> > limit of available priority levels, in order to avoid having<br>
> to leave a<br>
> > hw capability unused. I suspect msm would be happier if they<br>
> could have<br>
> > all priority levels equal in terms of whether they apply only<br>
> at the<br>
> > frontend level or completely throughout the pipeline.<br>
> ><br>
> > > [1]<br>
> ><br>
> <a href="https://patchwork.freedesktop.org/patch/515857/?series=112189&rev=1" rel="noreferrer" target="_blank">https://patchwork.freedesktop.org/patch/515857/?series=112189&rev=1</a><br>
> <<a href="https://patchwork.freedesktop.org/patch/515857/?series=112189&rev=1" rel="noreferrer" target="_blank">https://patchwork.freedesktop.org/patch/515857/?series=112189&rev=1</a>><br>
> > <br>
> <<a href="https://patchwork.freedesktop.org/patch/515857/?series=112189&rev=1" rel="noreferrer" target="_blank">https://patchwork.freedesktop.org/patch/515857/?series=112189&rev=1</a> <<a href="https://patchwork.freedesktop.org/patch/515857/?series=112189&rev=1" rel="noreferrer" target="_blank">https://patchwork.freedesktop.org/patch/515857/?series=112189&rev=1</a>>><br>
> > ><br>
> > >>> What would be interesting to learn is whether the option of<br>
> > refactoring<br>
> > >>> drm_sched to deal with out of order completion was<br>
> considered<br>
> > and what were<br>
> > >>> the conclusions.<br>
> > >>><br>
> > >><br>
> > >> I coded this up a while back when trying to convert the<br>
> i915 to<br>
> > the DRM<br>
> > >> scheduler it isn't all that hard either. The free flow<br>
> control<br>
> > on the<br>
> > >> ring (e.g. set job limit == SIZE OF RING / MAX JOB SIZE) is<br>
> > really what<br>
> > >> sold me on the this design.<br>
> ><br>
> ><br>
> > You're not the only one to suggest supporting out-of-order<br>
> completion.<br>
> > However, it's tricky and breaks a lot of internal assumptions of the<br>
> > scheduler. It also reduces functionality a bit because it can no<br>
> longer<br>
> > automatically rate-limit HW/FW queues which are often<br>
> fixed-size. (Ok,<br>
> > yes, it probably could but it becomes a substantially harder<br>
> problem.)<br>
> ><br>
> > It also seems like a worse mapping to me. The goal here is to turn<br>
> > submissions on a userspace-facing engine/queue into submissions<br>
> to a FW<br>
> > queue submissions, sorting out any dma_fence dependencies. Matt's<br>
> > description of saying this is a 1:1 mapping between sched/entity<br>
> doesn't<br>
> > tell the whole story. It's a 1:1:1 mapping between xe_engine,<br>
> > gpu_scheduler, and GuC FW engine. Why make it a 1:something:1<br>
> mapping?<br>
> > Why is that better?<br>
> <br>
> As I have stated before, what I think what would fit well for Xe is one<br>
> drm_scheduler per engine class. In specific terms on our current<br>
> hardware, one drm scheduler instance for render, compute, blitter,<br>
> video<br>
> and video enhance. Userspace contexts remain scheduler entities.<br>
> <br>
> <br>
> And this is where we fairly strongly disagree. More in a bit.<br>
> <br>
> That way you avoid the whole kthread/kworker story and you have it<br>
> actually use the entity picking code in the scheduler, which may be<br>
> useful when the backend is congested.<br>
> <br>
> <br>
> What back-end congestion are you referring to here? Running out of FW <br>
> queue IDs? Something else?<br>
<br>
CT channel, number of context ids.<br>
<br>
> <br>
> Yes you have to solve the out of order problem so in my mind that is<br>
> something to discuss. What the problem actually is (just TDR?), how<br>
> tricky and why etc.<br>
> <br>
> And yes you lose the handy LRCA ring buffer size management so you'd<br>
> have to make those entities not runnable in some other way.<br>
> <br>
> Regarding the argument you raise below - would any of that make the<br>
> frontend / backend separation worse and why? Do you think it is less<br>
> natural? If neither is true then all remains is that it appears extra<br>
> work to support out of order completion of entities has been discounted<br>
> in favour of an easy but IMO inelegant option.<br>
> <br>
> <br>
> Broadly speaking, the kernel needs to stop thinking about GPU scheduling <br>
> in terms of scheduling jobs and start thinking in terms of scheduling <br>
> contexts/engines. There is still some need for scheduling individual <br>
> jobs but that is only for the purpose of delaying them as needed to <br>
> resolve dma_fence dependencies. Once dependencies are resolved, they <br>
> get shoved onto the context/engine queue and from there the kernel only <br>
> really manages whole contexts/engines. This is a major architectural <br>
> shift, entirely different from the way i915 scheduling works. It's also <br>
> different from the historical usage of DRM scheduler which I think is <br>
> why this all looks a bit funny.<br>
> <br>
> To justify this architectural shift, let's look at where we're headed. <br>
> In the glorious future...<br>
> <br>
> 1. Userspace submits directly to firmware queues. The kernel has no <br>
> visibility whatsoever into individual jobs. At most it can pause/resume <br>
> FW contexts as needed to handle eviction and memory management.<br>
> <br>
> 2. Because of 1, apart from handing out the FW queue IDs at the <br>
> beginning, the kernel can't really juggle them that much. Depending on <br>
> FW design, it may be able to pause a client, give its IDs to another, <br>
> and then resume it later when IDs free up. What it's not doing is <br>
> juggling IDs on a job-by-job basis like i915 currently is.<br>
> <br>
> 3. Long-running compute jobs may not complete for days. This means <br>
> that memory management needs to happen in terms of pause/resume of <br>
> entire contexts/engines using the memory rather than based on waiting <br>
> for individual jobs to complete or pausing individual jobs until the <br>
> memory is available.<br>
> <br>
> 4. Synchronization happens via userspace memory fences (UMF) and the <br>
> kernel is mostly unaware of most dependencies and when a context/engine <br>
> is or is not runnable. Instead, it keeps as many of them minimally <br>
> active (memory is available, even if it's in system RAM) as possible and <br>
> lets the FW sort out dependencies. (There may need to be some facility <br>
> for sleeping a context until a memory change similar to futex() or <br>
> poll() for userspace threads. There are some details TBD.)<br>
> <br>
> Are there potential problems that will need to be solved here? Yes. Is <br>
> it a good design? Well, Microsoft has been living in this future for <br>
> half a decade or better and it's working quite well for them. It's also <br>
> the way all modern game consoles work. It really is just Linux that's <br>
> stuck with the same old job model we've had since the monumental shift <br>
> to DRI2.<br>
> <br>
> To that end, one of the core goals of the Xe project was to make the <br>
> driver internally behave as close to the above model as possible while <br>
> keeping the old-school job model as a very thin layer on top. As the <br>
> broader ecosystem problems (window-system support for UMF, for instance) <br>
> are solved, that layer can be peeled back. The core driver will already <br>
> be ready for it.<br>
> <br>
> To that end, the point of the DRM scheduler in Xe isn't to schedule <br>
> jobs. It's to resolve syncobj and dma-buf implicit sync dependencies <br>
> and stuff jobs into their respective context/engine queue once they're <br>
> ready. All the actual scheduling happens in firmware and any scheduling <br>
> the kernel does to deal with contention, oversubscriptions, too many <br>
> contexts, etc. is between contexts/engines, not individual jobs. Sure, <br>
> the individual job visibility is nice, but if we design around it, we'll <br>
> never get to the glorious future.<br>
> <br>
> I really need to turn the above (with a bit more detail) into a blog <br>
> post.... Maybe I'll do that this week.<br>
> <br>
> In any case, I hope that provides more insight into why Xe is designed <br>
> the way it is and why I'm pushing back so hard on trying to make it more <br>
> of a "classic" driver as far as scheduling is concerned. Are there <br>
> potential problems here? Yes, that's why Xe has been labeled a <br>
> prototype. Are such radical changes necessary to get to said glorious <br>
> future? Yes, I think they are. Will it be worth it? I believe so.<br>
<br>
Right, that's all solid I think. My takeaway is that frontend priority <br>
sorting and that stuff isn't needed and that is okay. And that there are <br>
multiple options to maybe improve drm scheduler, like the fore mentioned <br>
making it deal with out of order, or split into functional components, <br>
or split frontend/backend what you suggested. For most of them cost vs <br>
benefit is more or less not completely clear, neither how much effort <br>
was invested to look into them.<br>
<br>
One thing I missed from this explanation is how drm_scheduler per engine <br>
class interferes with the high level concepts. And I did not manage to <br>
pick up on what exactly is the TDR problem in that case. Maybe the two <br>
are one and the same.<br>
<br>
Bottom line is I still have the concern that conversion to kworkers has <br>
an opportunity to regress. Possibly more opportunity for some Xe use <br>
cases than to affect other vendors, since they would still be using per <br>
physical engine / queue scheduler instances.<br>
<br>
And to put my money where my mouth is I will try to put testing Xe <br>
inside the full blown ChromeOS environment in my team plans. It would <br>
probably also be beneficial if Xe team could take a look at real world <br>
behaviour of the extreme transcode use cases too. If the stack is ready <br>
for that and all. It would be better to know earlier rather than later <br>
if there is a fundamental issue.<br>
<br>
For the patch at hand, and the cover letter, it certainly feels it would <br>
benefit to record the past design discussion had with AMD folks, to <br>
explicitly copy other drivers, and to record the theoretical pros and <br>
cons of threads vs unbound workers as I have tried to highlight them.<br>
<br>
Regards,<br>
<br>
Tvrtko<br>
</blockquote></div></div>