[Intel-gfx] [PATCH 36/56] drm/i915: Fair low-latency scheduling

Thu Jan 7 16:45:31 UTC 2021

Quoting Matthew Brost (2021-01-07 16:05:07)
> On Tue, Dec 29, 2020 at 12:01:25PM +0000, Chris Wilson wrote:
> > The first "scheduler" was a topographical sorting of requests into
> > priority order. The execution order was deterministic, the earliest
> > submitted, highest priority request would be executed first. Priority
> > inheritance ensured that inversions were kept at bay, and allowed us to
> > dynamically boost priorities (e.g. for interactive pageflips).
> > 
> > The minimalistic timeslicing scheme was an attempt to introduce fairness
> > between long running requests, by evicting the active request at the end
> > of a timeslice and moving it to the back of its priority queue (while
> > ensuring that dependencies were kept in order). For short running
> > requests from many clients of equal priority, the scheme is still very
> > much FIFO submission ordering, and as unfair as before.
> > 
> > To impose fairness, we need an external metric that ensures that clients
> > are interpersed, so we don't execute one long chain from client A before
> > executing any of client B. This could be imposed by the clients
> > themselves by using fences based on an external clock, that is they only
> > submit work for a "frame" at frame-intervals, instead of submitting as
> > much work as they are able to. The standard SwapBuffers approach is akin
> > to double bufferring, where as one frame is being executed, the next is
> > being submitted, such that there is always a maximum of two frames per
> > client in the pipeline and so ideally maintains consistent input-output
> > latency. Even this scheme exhibits unfairness under load as a single
> > client will execute two frames back to back before the next, and with
> > enough clients, deadlines will be missed.
> > 
> > The idea introduced by BFS/MuQSS is that fairness is introduced by
> > metering with an external clock. Every request, when it becomes ready to
> > execute is assigned a virtual deadline, and execution order is then
> > determined by earliest deadline. Priority is used as a hint, rather than
> > strict ordering, where high priority requests have earlier deadlines,
> > but not necessarily earlier than outstanding work. Thus work is executed
> > in order of 'readiness', with timeslicing to demote long running work.
> > 
> > The Achille's heel of this scheduler is its strong preference for
> > low-latency and favouring of new queues. Whereas it was easy to dominate
> > the old scheduler by flooding it with many requests over a short period
> > of time, the new scheduler can be dominated by a 'synchronous' client
> > that waits for each of its requests to complete before submitting the
> > next. As such a client has no history, it is always considered
> > ready-to-run and receives an earlier deadline than the long running
> > requests. This is compensated for by refreshing the current execution's
> > deadline and by disallowing preemption for timeslice shuffling.
> > 
> > To check the impact on throughput (often the downfall of latency
> > sensitive schedulers), we used gem_wsim to simulate various transcode
> > workloads with different load balancers, and varying the number of
> > competing [heterogenous] clients.
> > 
> > +delta%------------------------------------------------------------------+
> > |                                a                                       |
> > |                                a                                       |
> > |                                aa                                      |
> > |                                aa                                      |
> > |                                aa                                      |
> > |                                aa                                      |
> > |                               aaa                                      |
> > |                              aaaa                                      |
> > |                           a  aaaaa                                     |
> > |                           a aaaaaa                                     |
> > |a              aa   a      aaaaaaaaaa aa               a               a|
> > |                                A_|                                     |
> > +------------------------------------------------------------------------+
> >    N          Min           Max        Median           Avg        Stddev
> >  108   -23.982194     28.421527  -0.077474828  -0.072650418    0.16179718
> > 
> > The impact was on average 0.1% under contention due to the change in
> > context execution order and number of context switches. The biggest
> > swings are due to the execution ordering favouring one client or another,
> > and maybe room for improvement.
> > 
> 
> I haven't dug into this series deeply but it does seem plausible for
> execlist submission. However this new scheduler seems completely broken
> for Guc submission.

Underneath it's the same topological sort, just with a different key.
At worst, that key can be the plain priority again. But that's pretty
much immaterial in terms of deciding the order which requests are
transferred from one queue to another. Execution order wants to be based
on resource prioritisation (beyond request priorisation), ideally we
would keep our fine grained scheduling control.

I'm not convinced by the state of the fw's scheduler...

> The scheduler really isn't used that much with GuC
> submission but when it is the old one works perfectly. The vfunc for
> the scheduler is deleted in patch #25. I don't think that is good idea
> as it appears we are trending to 2 separate schedulers.

You still use the i915_request_set_priority() entry point as the
caller into the backend bypass.

Isn't the current guc.fw a second closed driver for the same HW anyway?

> Lastly this series seems to be on the wrong list, I believe it should be
> posted elsewhere first.

It's the right place for upstreaming bug fixes on stable platforms. And
after a year of being here, that's a moot point.
-Chris