[Intel-gfx] [PATCH 26/28] drm/i915: Fair low-latency scheduling

Tue Jun 16 10:12:55 UTC 2020

Quoting Thomas Hellström (Intel) (2020-06-16 10:07:28)
> Hi, Chris,
> 
> Some comments and questions:
> 
> On 6/8/20 12:21 AM, Chris Wilson wrote:
> > The first "scheduler" was a topographical sorting of requests into
> > priority order. The execution order was deterministic, the earliest
> > submitted, highest priority request would be executed first. Priority
> > inherited ensured that inversions were kept at bay, and allowed us to
> > dynamically boost priorities (e.g. for interactive pageflips).
> >
> > The minimalistic timeslicing scheme was an attempt to introduce fairness
> > between long running requests, by evicting the active request at the end
> > of a timeslice and moving it to the back of its priority queue (while
> > ensuring that dependencies were kept in order). For short running
> > requests from many clients of equal priority, the scheme is still very
> > much FIFO submission ordering, and as unfair as before.
> >
> > To impose fairness, we need an external metric that ensures that clients
> > are interpersed, we don't execute one long chain from client A before
> > executing any of client B. This could be imposed by the clients by using
> > a fences based on an external clock, that is they only submit work for a
> > "frame" at frame-interval, instead of submitting as much work as they
> > are able to. The standard SwapBuffers approach is akin to double
> > bufferring, where as one frame is being executed, the next is being
> > submitted, such that there is always a maximum of two frames per client
> > in the pipeline. Even this scheme exhibits unfairness under load as a
> > single client will execute two frames back to back before the next, and
> > with enough clients, deadlines will be missed.
> >
> > The idea introduced by BFS/MuQSS is that fairness is introduced by
> > metering with an external clock. Every request, when it becomes ready to
> > execute is assigned a virtual deadline, and execution order is then
> > determined by earliest deadline. Priority is used as a hint, rather than
> > strict ordering, where high priority requests have earlier deadlines,
> > but not necessarily earlier than outstanding work. Thus work is executed
> > in order of 'readiness', with timeslicing to demote long running work.
> >
> > The Achille's heel of this scheduler is its strong preference for
> > low-latency and favouring of new queues. Whereas it was easy to dominate
> > the old scheduler by flooding it with many requests over a short period
> > of time, the new scheduler can be dominated by a 'synchronous' client
> > that waits for each of its requests to complete before submitting the
> > next. As such a client has no history, it is always considered
> > ready-to-run and receives an earlier deadline than the long running
> > requests.
> >
> > Signed-off-by: Chris Wilson <chris at chris-wilson.co.uk>
> > ---
> >   drivers/gpu/drm/i915/gt/intel_engine_cs.c     |  12 +-
> >   .../gpu/drm/i915/gt/intel_engine_heartbeat.c  |   1 +
> >   drivers/gpu/drm/i915/gt/intel_engine_pm.c     |   4 +-
> >   drivers/gpu/drm/i915/gt/intel_engine_types.h  |  24 --
> >   drivers/gpu/drm/i915/gt/intel_lrc.c           | 328 +++++++-----------
> >   drivers/gpu/drm/i915/gt/selftest_hangcheck.c  |   5 +-
> >   drivers/gpu/drm/i915/gt/selftest_lrc.c        |  43 ++-
> >   .../gpu/drm/i915/gt/uc/intel_guc_submission.c |   6 +-
> >   drivers/gpu/drm/i915/i915_priolist_types.h    |   7 +-
> >   drivers/gpu/drm/i915/i915_request.h           |   4 +-
> >   drivers/gpu/drm/i915/i915_scheduler.c         | 322 ++++++++++++-----
> >   drivers/gpu/drm/i915/i915_scheduler.h         |  22 +-
> >   drivers/gpu/drm/i915/i915_scheduler_types.h   |  17 +
> >   .../drm/i915/selftests/i915_mock_selftests.h  |   1 +
> >   drivers/gpu/drm/i915/selftests/i915_request.c |   1 +
> >   .../gpu/drm/i915/selftests/i915_scheduler.c   |  49 +++
> >   16 files changed, 484 insertions(+), 362 deletions(-)
> >   create mode 100644 drivers/gpu/drm/i915/selftests/i915_scheduler.c
> 
> Do we have timings to back this change up? Would it make sense to have a 
> configurable scheduler choice?

Yes, there's igt/benchmarks/gem_wsim to show the impact on scheduling
decisions for various workloads. (You can guess what the impact of
choosing a different execution order and forcing more context switches
will be... About -1% to throughput with multiple clients) And
igt/tests/gem_exec_schedule to test basic properties, with a bunch of new
fairness tests to try and decide if this is the right thing. Under
saturated conditions, there is no contest, a fair scheduler produces
consistent results, and the vdeadlines allow for realtime-response under
load.

> > @@ -1096,22 +1099,30 @@ __unwind_incomplete_requests(struct intel_engine_cs *engine)
> >   {
> >       struct i915_request *rq, *rn, *active = NULL;
> >       struct list_head *uninitialized_var(pl);
> > -     int prio = I915_PRIORITY_INVALID;
> > +     u64 deadline = I915_DEADLINE_NEVER;
> >   
> >       lockdep_assert_held(&engine->active.lock);
> >   
> >       list_for_each_entry_safe_reverse(rq, rn,
> >                                        &engine->active.requests,
> >                                        sched.link) {
> > -             if (i915_request_completed(rq))
> > +             if (i915_request_completed(rq)) {
> > +                     list_del_init(&rq->sched.link);
> >                       continue; /* XXX */
> > +             }
> 
> Is this an unrelated change? If so separate patch?

It's not totally unrelated :)

> > @@ -2162,14 +2140,13 @@ static void execlists_dequeue(struct intel_engine_cs *engine)
> >                       __unwind_incomplete_requests(engine);
> >   
> >                       last = NULL;
> > -             } else if (need_timeslice(engine, last, ve) &&
> > -                        timeslice_expired(execlists, last)) {
> > +             } else if (timeslice_expired(engine, last)) {
> >                       ENGINE_TRACE(engine,
> > -                                  "expired last=%llx:%lld, prio=%d, hint=%d, yield?=%s\n",
> > -                                  last->fence.context,
> > -                                  last->fence.seqno,
> > -                                  last->sched.attr.priority,
> > -                                  execlists->queue_priority_hint,
> > +                                  "expired:%s last=%llx:%llu, deadline=%llu, now=%llu, yield?=%s\n",
> > +                                  yesno(timer_expired(&execlists->timer)),
> > +                                  last->fence.context, last->fence.seqno,
> > +                                  rq_deadline(last),
> > +                                  i915_sched_to_ticks(ktime_get()),
> >                                    yesno(timeslice_yield(execlists, last)));
> 
> There are multiple introductions of ktime_get() in the patch. Perhaps 
> use monotonic clock source like ktime_get_raw()? Also immediately 
> convert to ns.

ktime_get() is monotonic. The only difference is that tkr_mono has an
wall-offset that tkr_raw does not. [I'm sure there's a good reason.] The
choice is really whether ktime_get_(mono|raw)_fast_ns() is sufficient for
our needs.

I do like the idea of having the deadline being some recognisable
timestamp, as it makes it easier to play with mixing in real, albeit
soft, deadlines.

> > @@ -2837,10 +2788,7 @@ static void __execlists_unhold(struct i915_request *rq)
> >               GEM_BUG_ON(!i915_sw_fence_signaled(&rq->submit));
> >   
> >               i915_request_clear_hold(rq);
> > -             list_move_tail(&rq->sched.link,
> > -                            i915_sched_lookup_priolist(rq->engine,
> > -                                                       rq_prio(rq)));
> > -             set_bit(I915_FENCE_FLAG_PQUEUE, &rq->fence.flags);
> > +             submit |= intel_engine_queue_request(rq->engine, rq);
> 
> As new to this codebase, I immediately wonder whether that bitwise or is 
> intentional and whether you got the short-circuiting right. It looks 
> correct to me.

bool submit, not many bits :)

> > diff --git a/drivers/gpu/drm/i915/i915_request.h b/drivers/gpu/drm/i915/i915_request.h
> > index 118ab6650d1f..23594e712292 100644
> > --- a/drivers/gpu/drm/i915/i915_request.h
> > +++ b/drivers/gpu/drm/i915/i915_request.h
> > @@ -561,7 +561,7 @@ static inline void i915_request_clear_hold(struct i915_request *rq)
> >   }
> >   
> >   static inline struct intel_timeline *
> > -i915_request_timeline(struct i915_request *rq)
> > +i915_request_timeline(const struct i915_request *rq)
> >   {
> >       /* Valid only while the request is being constructed (or retired). */
> >       return rcu_dereference_protected(rq->timeline,
> > @@ -576,7 +576,7 @@ i915_request_gem_context(struct i915_request *rq)
> >   }
> >   
> >   static inline struct intel_timeline *
> > -i915_request_active_timeline(struct i915_request *rq)
> > +i915_request_active_timeline(const struct i915_request *rq)
> 
> Are these unrelated? Separate patch?

They were used at one point, when I had a const request.

> >   {
> >       /*
> >        * When in use during submission, we are protected by a guarantee that
> > diff --git a/drivers/gpu/drm/i915/i915_scheduler.c b/drivers/gpu/drm/i915/i915_scheduler.c
> > index 4c189b81cc62..30bcb6f9d99f 100644
> > --- a/drivers/gpu/drm/i915/i915_scheduler.c
> > +++ b/drivers/gpu/drm/i915/i915_scheduler.c
> > @@ -20,6 +20,11 @@ static struct i915_global_scheduler {
> >   static DEFINE_SPINLOCK(ipi_lock);
> >   static LIST_HEAD(ipi_list);
> >   
> > +static inline u64 rq_deadline(const struct i915_request *rq)
> > +{
> > +     return READ_ONCE(rq->sched.deadline);
> > +}
> > +
> 
> Does this need a release barrier paired with an acquire barrier in 
> __i915_request_set_deadline below?

No, the state can be inconsistent. If it changes as we are processing
the previous value, there will be another reschedule. Within
set_deadline, rq->sched.deadline is under the engine->active.lock, it is
just that rq_deadline() is used to peek before we take the lock, as well
as shorthand within the critical section.

> > +static bool __i915_request_set_deadline(struct i915_request *rq, u64 deadline)
> > +{
> > +     struct intel_engine_cs *engine = rq->engine;
> > +     struct i915_request *rn;
> > +     struct list_head *plist;
> > +     LIST_HEAD(dfs);
> > +
> > +     lockdep_assert_held(&engine->active.lock);
> > +     list_add(&rq->sched.dfs, &dfs);
> > +
> > +     list_for_each_entry(rq, &dfs, sched.dfs) {
> > +             struct i915_dependency *p;
> > +
> > +             GEM_BUG_ON(rq->engine != engine);
> > +
> > +             for_each_signaler(p, rq) {
> > +                     struct i915_request *s =
> > +                             container_of(p->signaler, typeof(*s), sched);
> > +
> > +                     GEM_BUG_ON(s == rq);
> > +
> > +                     if (rq_deadline(s) <= deadline)
> > +                             continue;
> > +
> > +                     if (i915_request_completed(s))
> > +                             continue;
> > +
> > +                     if (s->engine != rq->engine) {
> > +                             spin_lock(&ipi_lock);
> > +                             if (deadline < p->ipi_deadline) {
> > +                                     p->ipi_deadline = deadline;
> > +                                     list_move(&p->ipi_link, &ipi_list);
> > +                                     irq_work_queue(&ipi_work);
> > +                             }
> > +                             spin_unlock(&ipi_lock);
> > +                             continue;
> > +                     }
> > +
> > +                     list_move_tail(&s->sched.dfs, &dfs);
> > +             }
> > +     }
> > +
> > +     plist = i915_sched_lookup_priolist(engine, deadline);
> > +
> > +     /* Fifo and depth-first replacement ensure our deps execute first */
> > +     list_for_each_entry_safe_reverse(rq, rn, &dfs, sched.dfs) {
> > +             GEM_BUG_ON(rq->engine != engine);
> > +             GEM_BUG_ON(deadline > rq_deadline(rq));
> > +
> > +             INIT_LIST_HEAD(&rq->sched.dfs);
> > +             WRITE_ONCE(rq->sched.deadline, deadline);
> 
> An smp barrier needed?

No. It is locked by engine->active.lock, with a couple of peeks before
the lock that do not require serialisation with other changes.

> > +static u64 prio_slice(int prio)
> >   {
> > -     const struct i915_request *inflight;
> > +     u64 slice;
> > +     int sf;
> >   
> >       /*
> > -      * We only need to kick the tasklet once for the high priority
> > -      * new context we add into the queue.
> > +      * With a 1ms scheduling quantum:
> > +      *
> > +      *   MAX USER:  ~32us deadline
> > +      *   0:         ~16ms deadline
> > +      *   MIN_USER: 1000ms deadline
> >        */
> > -     if (prio <= engine->execlists.queue_priority_hint)
> > -             return;
> >   
> > -     rcu_read_lock();
> > +     if (prio >= __I915_PRIORITY_KERNEL__)
> > +             return INT_MAX - prio;
> >   
> > -     /* Nothing currently active? We're overdue for a submission! */
> > -     inflight = execlists_active(&engine->execlists);
> > -     if (!inflight)
> > -             goto unlock;
> > +     slice = __I915_PRIORITY_KERNEL__ - prio;
> > +     if (prio >= 0)
> > +             sf = 20 - 6;
> > +     else
> > +             sf = 20 - 1;
> > +
> > +     return slice << sf;
> > +}
> > +
> 
> Is this the same deadline calculation as used in the BFS? Could you 
> perhaps add a pointer to some documentation?

It is a heuristic. The scale factor in BFS is designed for a smaller
range and is not effective for passing our existing priority ordering
tests.

The challenge is to pick something that is fair that roughly matches
usage. It basically says that if client A submits 3 requests, then
client B, C will be able to run before the later requests of client A so
long as they are submitted within 16ms. Currently we get AAABC,
the vdeadlines turn that into ABCAA. So we would ideally like the quota
for each client to reflect their needs, so if client A needed all 3
requests within 16ms, it would have a vdeadline closer to 5ms (and so it
would compete for the GPU against other clients). Now with this less
strict priority system we can let normal userspace bump their
priorities, or we can use the average context runtime to try and adjust
priorities on the fly (i.e. do not used an unbias quota). But I suspect
removing any fairness will skew the scheduler once more.
-Chris