[Intel-gfx] [PATCH 26/28] drm/i915: Fair low-latency scheduling

Tue Jun 16 12:11:25 UTC 2020

Hi,

On 6/16/20 12:12 PM, Chris Wilson wrote:
> Quoting Thomas Hellström (Intel) (2020-06-16 10:07:28)
>> Hi, Chris,
>>
>> Some comments and questions:
>>
>> On 6/8/20 12:21 AM, Chris Wilson wrote:
>>> The first "scheduler" was a topographical sorting of requests into
>>> priority order. The execution order was deterministic, the earliest
>>> submitted, highest priority request would be executed first. Priority
>>> inherited ensured that inversions were kept at bay, and allowed us to
>>> dynamically boost priorities (e.g. for interactive pageflips).
>>>
>>> The minimalistic timeslicing scheme was an attempt to introduce fairness
>>> between long running requests, by evicting the active request at the end
>>> of a timeslice and moving it to the back of its priority queue (while
>>> ensuring that dependencies were kept in order). For short running
>>> requests from many clients of equal priority, the scheme is still very
>>> much FIFO submission ordering, and as unfair as before.
>>>
>>> To impose fairness, we need an external metric that ensures that clients
>>> are interpersed, we don't execute one long chain from client A before
>>> executing any of client B. This could be imposed by the clients by using
>>> a fences based on an external clock, that is they only submit work for a
>>> "frame" at frame-interval, instead of submitting as much work as they
>>> are able to. The standard SwapBuffers approach is akin to double
>>> bufferring, where as one frame is being executed, the next is being
>>> submitted, such that there is always a maximum of two frames per client
>>> in the pipeline. Even this scheme exhibits unfairness under load as a
>>> single client will execute two frames back to back before the next, and
>>> with enough clients, deadlines will be missed.
>>>
>>> The idea introduced by BFS/MuQSS is that fairness is introduced by
>>> metering with an external clock. Every request, when it becomes ready to
>>> execute is assigned a virtual deadline, and execution order is then
>>> determined by earliest deadline. Priority is used as a hint, rather than
>>> strict ordering, where high priority requests have earlier deadlines,
>>> but not necessarily earlier than outstanding work. Thus work is executed
>>> in order of 'readiness', with timeslicing to demote long running work.
>>>
>>> The Achille's heel of this scheduler is its strong preference for
>>> low-latency and favouring of new queues. Whereas it was easy to dominate
>>> the old scheduler by flooding it with many requests over a short period
>>> of time, the new scheduler can be dominated by a 'synchronous' client
>>> that waits for each of its requests to complete before submitting the
>>> next. As such a client has no history, it is always considered
>>> ready-to-run and receives an earlier deadline than the long running
>>> requests.
>>>
>>> Signed-off-by: Chris Wilson <chris at chris-wilson.co.uk>
>>> ---
>>>    drivers/gpu/drm/i915/gt/intel_engine_cs.c     |  12 +-
>>>    .../gpu/drm/i915/gt/intel_engine_heartbeat.c  |   1 +
>>>    drivers/gpu/drm/i915/gt/intel_engine_pm.c     |   4 +-
>>>    drivers/gpu/drm/i915/gt/intel_engine_types.h  |  24 --
>>>    drivers/gpu/drm/i915/gt/intel_lrc.c           | 328 +++++++-----------
>>>    drivers/gpu/drm/i915/gt/selftest_hangcheck.c  |   5 +-
>>>    drivers/gpu/drm/i915/gt/selftest_lrc.c        |  43 ++-
>>>    .../gpu/drm/i915/gt/uc/intel_guc_submission.c |   6 +-
>>>    drivers/gpu/drm/i915/i915_priolist_types.h    |   7 +-
>>>    drivers/gpu/drm/i915/i915_request.h           |   4 +-
>>>    drivers/gpu/drm/i915/i915_scheduler.c         | 322 ++++++++++++-----
>>>    drivers/gpu/drm/i915/i915_scheduler.h         |  22 +-
>>>    drivers/gpu/drm/i915/i915_scheduler_types.h   |  17 +
>>>    .../drm/i915/selftests/i915_mock_selftests.h  |   1 +
>>>    drivers/gpu/drm/i915/selftests/i915_request.c |   1 +
>>>    .../gpu/drm/i915/selftests/i915_scheduler.c   |  49 +++
>>>    16 files changed, 484 insertions(+), 362 deletions(-)
>>>    create mode 100644 drivers/gpu/drm/i915/selftests/i915_scheduler.c
>> Do we have timings to back this change up? Would it make sense to have a
>> configurable scheduler choice?
> Yes, there's igt/benchmarks/gem_wsim to show the impact on scheduling
> decisions for various workloads. (You can guess what the impact of
> choosing a different execution order and forcing more context switches
> will be... About -1% to throughput with multiple clients) And
> igt/tests/gem_exec_schedule to test basic properties, with a bunch of new
> fairness tests to try and decide if this is the right thing. Under
> saturated conditions, there is no contest, a fair scheduler produces
> consistent results, and the vdeadlines allow for realtime-response under
> load.

Yeah, it's not really to convince me, but to provide a reference for the 
future.

Perhaps add the gem_wsim timings to the commit message?

>
>> There are multiple introductions of ktime_get() in the patch. Perhaps
>> use monotonic clock source like ktime_get_raw()? Also immediately
>> convert to ns.
> ktime_get() is monotonic. The only difference is that tkr_mono has an
> wall-offset that tkr_raw does not. [I'm sure there's a good reason.] The
> choice is really whether ktime_get_(mono|raw)_fast_ns() is sufficient for
> our needs.

Hmm. Yes you're right. I was thinking about the NTP adjustments. But 
given the requirement below they might be unimportant.

>
> I do like the idea of having the deadline being some recognisable
> timestamp, as it makes it easier to play with mixing in real, albeit
> soft, deadlines.

>
>>> @@ -2837,10 +2788,7 @@ static void __execlists_unhold(struct i915_request *rq)
>>>                GEM_BUG_ON(!i915_sw_fence_signaled(&rq->submit));
>>>    
>>>                i915_request_clear_hold(rq);
>>> -             list_move_tail(&rq->sched.link,
>>> -                            i915_sched_lookup_priolist(rq->engine,
>>> -                                                       rq_prio(rq)));
>>> -             set_bit(I915_FENCE_FLAG_PQUEUE, &rq->fence.flags);
>>> +             submit |= intel_engine_queue_request(rq->engine, rq);
>> As new to this codebase, I immediately wonder whether that bitwise or is
>> intentional and whether you got the short-circuiting right. It looks
>> correct to me.
> bool submit, not many bits :)

Yes, the code is correct. My question was related to whether it was 
accepted practice, considering a future reader may think it might have 
been a mistake and change it anyway....

>
>>>    {
>>>        /*
>>>         * When in use during submission, we are protected by a guarantee that
>>> diff --git a/drivers/gpu/drm/i915/i915_scheduler.c b/drivers/gpu/drm/i915/i915_scheduler.c
>>> index 4c189b81cc62..30bcb6f9d99f 100644
>>> --- a/drivers/gpu/drm/i915/i915_scheduler.c
>>> +++ b/drivers/gpu/drm/i915/i915_scheduler.c
>>> @@ -20,6 +20,11 @@ static struct i915_global_scheduler {
>>>    static DEFINE_SPINLOCK(ipi_lock);
>>>    static LIST_HEAD(ipi_list);
>>>    
>>> +static inline u64 rq_deadline(const struct i915_request *rq)
>>> +{
>>> +     return READ_ONCE(rq->sched.deadline);
>>> +}
>>> +
>> Does this need a release barrier paired with an acquire barrier in
>> __i915_request_set_deadline below?
> No, the state can be inconsistent. If it changes as we are processing
> the previous value, there will be another reschedule. Within
> set_deadline, rq->sched.deadline is under the engine->active.lock, it is
> just that rq_deadline() is used to peek before we take the lock, as well
> as shorthand within the critical section.
>   

OK, understood.

>>> +static u64 prio_slice(int prio)
>>>    {
>>> -     const struct i915_request *inflight;
>>> +     u64 slice;
>>> +     int sf;
>>>    
>>>        /*
>>> -      * We only need to kick the tasklet once for the high priority
>>> -      * new context we add into the queue.
>>> +      * With a 1ms scheduling quantum:
>>> +      *
>>> +      *   MAX USER:  ~32us deadline
>>> +      *   0:         ~16ms deadline
>>> +      *   MIN_USER: 1000ms deadline
>>>         */
>>> -     if (prio <= engine->execlists.queue_priority_hint)
>>> -             return;
>>>    
>>> -     rcu_read_lock();
>>> +     if (prio >= __I915_PRIORITY_KERNEL__)
>>> +             return INT_MAX - prio;
>>>    
>>> -     /* Nothing currently active? We're overdue for a submission! */
>>> -     inflight = execlists_active(&engine->execlists);
>>> -     if (!inflight)
>>> -             goto unlock;
>>> +     slice = __I915_PRIORITY_KERNEL__ - prio;
>>> +     if (prio >= 0)
>>> +             sf = 20 - 6;
>>> +     else
>>> +             sf = 20 - 1;
>>> +
>>> +     return slice << sf;
>>> +}
>>> +
>> Is this the same deadline calculation as used in the BFS? Could you
>> perhaps add a pointer to some documentation?
> It is a heuristic. The scale factor in BFS is designed for a smaller
> range and is not effective for passing our existing priority ordering
> tests.
>
> The challenge is to pick something that is fair that roughly matches
> usage. It basically says that if client A submits 3 requests, then
> client B, C will be able to run before the later requests of client A so
> long as they are submitted within 16ms. Currently we get AAABC,
> the vdeadlines turn that into ABCAA. So we would ideally like the quota
> for each client to reflect their needs, so if client A needed all 3
> requests within 16ms, it would have a vdeadline closer to 5ms (and so it
> would compete for the GPU against other clients). Now with this less
> strict priority system we can let normal userspace bump their
> priorities, or we can use the average context runtime to try and adjust
> priorities on the fly (i.e. do not used an unbias quota). But I suspect
> removing any fairness will skew the scheduler once more.

OK, also for future reference, it would be good to have at least a 
subset of this documented somewhere!

Thanks,

Thomas

> -Chris