[Intel-gfx] [PATCH 26/28] drm/i915: Fair low-latency scheduling
Thomas Hellström (Intel)
thomas_os at shipmail.org
Tue Jun 16 09:07:28 UTC 2020
Hi, Chris,
Some comments and questions:
On 6/8/20 12:21 AM, Chris Wilson wrote:
> The first "scheduler" was a topographical sorting of requests into
> priority order. The execution order was deterministic, the earliest
> submitted, highest priority request would be executed first. Priority
> inherited ensured that inversions were kept at bay, and allowed us to
> dynamically boost priorities (e.g. for interactive pageflips).
>
> The minimalistic timeslicing scheme was an attempt to introduce fairness
> between long running requests, by evicting the active request at the end
> of a timeslice and moving it to the back of its priority queue (while
> ensuring that dependencies were kept in order). For short running
> requests from many clients of equal priority, the scheme is still very
> much FIFO submission ordering, and as unfair as before.
>
> To impose fairness, we need an external metric that ensures that clients
> are interpersed, we don't execute one long chain from client A before
> executing any of client B. This could be imposed by the clients by using
> a fences based on an external clock, that is they only submit work for a
> "frame" at frame-interval, instead of submitting as much work as they
> are able to. The standard SwapBuffers approach is akin to double
> bufferring, where as one frame is being executed, the next is being
> submitted, such that there is always a maximum of two frames per client
> in the pipeline. Even this scheme exhibits unfairness under load as a
> single client will execute two frames back to back before the next, and
> with enough clients, deadlines will be missed.
>
> The idea introduced by BFS/MuQSS is that fairness is introduced by
> metering with an external clock. Every request, when it becomes ready to
> execute is assigned a virtual deadline, and execution order is then
> determined by earliest deadline. Priority is used as a hint, rather than
> strict ordering, where high priority requests have earlier deadlines,
> but not necessarily earlier than outstanding work. Thus work is executed
> in order of 'readiness', with timeslicing to demote long running work.
>
> The Achille's heel of this scheduler is its strong preference for
> low-latency and favouring of new queues. Whereas it was easy to dominate
> the old scheduler by flooding it with many requests over a short period
> of time, the new scheduler can be dominated by a 'synchronous' client
> that waits for each of its requests to complete before submitting the
> next. As such a client has no history, it is always considered
> ready-to-run and receives an earlier deadline than the long running
> requests.
>
> Signed-off-by: Chris Wilson <chris at chris-wilson.co.uk>
> ---
> drivers/gpu/drm/i915/gt/intel_engine_cs.c | 12 +-
> .../gpu/drm/i915/gt/intel_engine_heartbeat.c | 1 +
> drivers/gpu/drm/i915/gt/intel_engine_pm.c | 4 +-
> drivers/gpu/drm/i915/gt/intel_engine_types.h | 24 --
> drivers/gpu/drm/i915/gt/intel_lrc.c | 328 +++++++-----------
> drivers/gpu/drm/i915/gt/selftest_hangcheck.c | 5 +-
> drivers/gpu/drm/i915/gt/selftest_lrc.c | 43 ++-
> .../gpu/drm/i915/gt/uc/intel_guc_submission.c | 6 +-
> drivers/gpu/drm/i915/i915_priolist_types.h | 7 +-
> drivers/gpu/drm/i915/i915_request.h | 4 +-
> drivers/gpu/drm/i915/i915_scheduler.c | 322 ++++++++++++-----
> drivers/gpu/drm/i915/i915_scheduler.h | 22 +-
> drivers/gpu/drm/i915/i915_scheduler_types.h | 17 +
> .../drm/i915/selftests/i915_mock_selftests.h | 1 +
> drivers/gpu/drm/i915/selftests/i915_request.c | 1 +
> .../gpu/drm/i915/selftests/i915_scheduler.c | 49 +++
> 16 files changed, 484 insertions(+), 362 deletions(-)
> create mode 100644 drivers/gpu/drm/i915/selftests/i915_scheduler.c
Do we have timings to back this change up? Would it make sense to have a
configurable scheduler choice?
> @@ -1096,22 +1099,30 @@ __unwind_incomplete_requests(struct intel_engine_cs *engine)
> {
> struct i915_request *rq, *rn, *active = NULL;
> struct list_head *uninitialized_var(pl);
> - int prio = I915_PRIORITY_INVALID;
> + u64 deadline = I915_DEADLINE_NEVER;
>
> lockdep_assert_held(&engine->active.lock);
>
> list_for_each_entry_safe_reverse(rq, rn,
> &engine->active.requests,
> sched.link) {
> - if (i915_request_completed(rq))
> + if (i915_request_completed(rq)) {
> + list_del_init(&rq->sched.link);
> continue; /* XXX */
> + }
Is this an unrelated change? If so separate patch?
...
> @@ -2162,14 +2140,13 @@ static void execlists_dequeue(struct intel_engine_cs *engine)
> __unwind_incomplete_requests(engine);
>
> last = NULL;
> - } else if (need_timeslice(engine, last, ve) &&
> - timeslice_expired(execlists, last)) {
> + } else if (timeslice_expired(engine, last)) {
> ENGINE_TRACE(engine,
> - "expired last=%llx:%lld, prio=%d, hint=%d, yield?=%s\n",
> - last->fence.context,
> - last->fence.seqno,
> - last->sched.attr.priority,
> - execlists->queue_priority_hint,
> + "expired:%s last=%llx:%llu, deadline=%llu, now=%llu, yield?=%s\n",
> + yesno(timer_expired(&execlists->timer)),
> + last->fence.context, last->fence.seqno,
> + rq_deadline(last),
> + i915_sched_to_ticks(ktime_get()),
> yesno(timeslice_yield(execlists, last)));
There are multiple introductions of ktime_get() in the patch. Perhaps
use monotonic clock source like ktime_get_raw()? Also immediately
convert to ns.
...
> @@ -2837,10 +2788,7 @@ static void __execlists_unhold(struct i915_request *rq)
> GEM_BUG_ON(!i915_sw_fence_signaled(&rq->submit));
>
> i915_request_clear_hold(rq);
> - list_move_tail(&rq->sched.link,
> - i915_sched_lookup_priolist(rq->engine,
> - rq_prio(rq)));
> - set_bit(I915_FENCE_FLAG_PQUEUE, &rq->fence.flags);
> + submit |= intel_engine_queue_request(rq->engine, rq);
As new to this codebase, I immediately wonder whether that bitwise or is
intentional and whether you got the short-circuiting right. It looks
correct to me.
...
> diff --git a/drivers/gpu/drm/i915/i915_request.h b/drivers/gpu/drm/i915/i915_request.h
> index 118ab6650d1f..23594e712292 100644
> --- a/drivers/gpu/drm/i915/i915_request.h
> +++ b/drivers/gpu/drm/i915/i915_request.h
> @@ -561,7 +561,7 @@ static inline void i915_request_clear_hold(struct i915_request *rq)
> }
>
> static inline struct intel_timeline *
> -i915_request_timeline(struct i915_request *rq)
> +i915_request_timeline(const struct i915_request *rq)
> {
> /* Valid only while the request is being constructed (or retired). */
> return rcu_dereference_protected(rq->timeline,
> @@ -576,7 +576,7 @@ i915_request_gem_context(struct i915_request *rq)
> }
>
> static inline struct intel_timeline *
> -i915_request_active_timeline(struct i915_request *rq)
> +i915_request_active_timeline(const struct i915_request *rq)
Are these unrelated? Separate patch?
> {
> /*
> * When in use during submission, we are protected by a guarantee that
> diff --git a/drivers/gpu/drm/i915/i915_scheduler.c b/drivers/gpu/drm/i915/i915_scheduler.c
> index 4c189b81cc62..30bcb6f9d99f 100644
> --- a/drivers/gpu/drm/i915/i915_scheduler.c
> +++ b/drivers/gpu/drm/i915/i915_scheduler.c
> @@ -20,6 +20,11 @@ static struct i915_global_scheduler {
> static DEFINE_SPINLOCK(ipi_lock);
> static LIST_HEAD(ipi_list);
>
> +static inline u64 rq_deadline(const struct i915_request *rq)
> +{
> + return READ_ONCE(rq->sched.deadline);
> +}
> +
Does this need a release barrier paired with an acquire barrier in
__i915_request_set_deadline below?
> +
> +static bool __i915_request_set_deadline(struct i915_request *rq, u64 deadline)
> +{
> + struct intel_engine_cs *engine = rq->engine;
> + struct i915_request *rn;
> + struct list_head *plist;
> + LIST_HEAD(dfs);
> +
> + lockdep_assert_held(&engine->active.lock);
> + list_add(&rq->sched.dfs, &dfs);
> +
> + list_for_each_entry(rq, &dfs, sched.dfs) {
> + struct i915_dependency *p;
> +
> + GEM_BUG_ON(rq->engine != engine);
> +
> + for_each_signaler(p, rq) {
> + struct i915_request *s =
> + container_of(p->signaler, typeof(*s), sched);
> +
> + GEM_BUG_ON(s == rq);
> +
> + if (rq_deadline(s) <= deadline)
> + continue;
> +
> + if (i915_request_completed(s))
> + continue;
> +
> + if (s->engine != rq->engine) {
> + spin_lock(&ipi_lock);
> + if (deadline < p->ipi_deadline) {
> + p->ipi_deadline = deadline;
> + list_move(&p->ipi_link, &ipi_list);
> + irq_work_queue(&ipi_work);
> + }
> + spin_unlock(&ipi_lock);
> + continue;
> + }
> +
> + list_move_tail(&s->sched.dfs, &dfs);
> + }
> + }
> +
> + plist = i915_sched_lookup_priolist(engine, deadline);
> +
> + /* Fifo and depth-first replacement ensure our deps execute first */
> + list_for_each_entry_safe_reverse(rq, rn, &dfs, sched.dfs) {
> + GEM_BUG_ON(rq->engine != engine);
> + GEM_BUG_ON(deadline > rq_deadline(rq));
> +
> + INIT_LIST_HEAD(&rq->sched.dfs);
> + WRITE_ONCE(rq->sched.deadline, deadline);
An smp barrier needed?
...
> +static u64 prio_slice(int prio)
> {
> - const struct i915_request *inflight;
> + u64 slice;
> + int sf;
>
> /*
> - * We only need to kick the tasklet once for the high priority
> - * new context we add into the queue.
> + * With a 1ms scheduling quantum:
> + *
> + * MAX USER: ~32us deadline
> + * 0: ~16ms deadline
> + * MIN_USER: 1000ms deadline
> */
> - if (prio <= engine->execlists.queue_priority_hint)
> - return;
>
> - rcu_read_lock();
> + if (prio >= __I915_PRIORITY_KERNEL__)
> + return INT_MAX - prio;
>
> - /* Nothing currently active? We're overdue for a submission! */
> - inflight = execlists_active(&engine->execlists);
> - if (!inflight)
> - goto unlock;
> + slice = __I915_PRIORITY_KERNEL__ - prio;
> + if (prio >= 0)
> + sf = 20 - 6;
> + else
> + sf = 20 - 1;
> +
> + return slice << sf;
> +}
> +
Is this the same deadline calculation as used in the BFS? Could you
perhaps add a pointer to some documentation?
/Thomas
More information about the Intel-gfx
mailing list