[Intel-gfx] [PATCH 26/28] drm/i915: Fair low-latency scheduling

Tue Jun 16 10:54:58 UTC 2020

Quoting Thomas Hellström (Intel) (2020-06-16 10:07:28)
> Hi, Chris,
> 
> Some comments and questions:
> 
> On 6/8/20 12:21 AM, Chris Wilson wrote:
> > The first "scheduler" was a topographical sorting of requests into
> > priority order. The execution order was deterministic, the earliest
> > submitted, highest priority request would be executed first. Priority
> > inherited ensured that inversions were kept at bay, and allowed us to
> > dynamically boost priorities (e.g. for interactive pageflips).
> >
> > The minimalistic timeslicing scheme was an attempt to introduce fairness
> > between long running requests, by evicting the active request at the end
> > of a timeslice and moving it to the back of its priority queue (while
> > ensuring that dependencies were kept in order). For short running
> > requests from many clients of equal priority, the scheme is still very
> > much FIFO submission ordering, and as unfair as before.
> >
> > To impose fairness, we need an external metric that ensures that clients
> > are interpersed, we don't execute one long chain from client A before
> > executing any of client B. This could be imposed by the clients by using
> > a fences based on an external clock, that is they only submit work for a
> > "frame" at frame-interval, instead of submitting as much work as they
> > are able to. The standard SwapBuffers approach is akin to double
> > bufferring, where as one frame is being executed, the next is being
> > submitted, such that there is always a maximum of two frames per client
> > in the pipeline. Even this scheme exhibits unfairness under load as a
> > single client will execute two frames back to back before the next, and
> > with enough clients, deadlines will be missed.
> >
> > The idea introduced by BFS/MuQSS is that fairness is introduced by
> > metering with an external clock. Every request, when it becomes ready to
> > execute is assigned a virtual deadline, and execution order is then
> > determined by earliest deadline. Priority is used as a hint, rather than
> > strict ordering, where high priority requests have earlier deadlines,
> > but not necessarily earlier than outstanding work. Thus work is executed
> > in order of 'readiness', with timeslicing to demote long running work.
> >
> > The Achille's heel of this scheduler is its strong preference for
> > low-latency and favouring of new queues. Whereas it was easy to dominate
> > the old scheduler by flooding it with many requests over a short period
> > of time, the new scheduler can be dominated by a 'synchronous' client
> > that waits for each of its requests to complete before submitting the
> > next. As such a client has no history, it is always considered
> > ready-to-run and receives an earlier deadline than the long running
> > requests.
> >
> > Signed-off-by: Chris Wilson <chris at chris-wilson.co.uk>
> > ---
> >   drivers/gpu/drm/i915/gt/intel_engine_cs.c     |  12 +-
> >   .../gpu/drm/i915/gt/intel_engine_heartbeat.c  |   1 +
> >   drivers/gpu/drm/i915/gt/intel_engine_pm.c     |   4 +-
> >   drivers/gpu/drm/i915/gt/intel_engine_types.h  |  24 --
> >   drivers/gpu/drm/i915/gt/intel_lrc.c           | 328 +++++++-----------
> >   drivers/gpu/drm/i915/gt/selftest_hangcheck.c  |   5 +-
> >   drivers/gpu/drm/i915/gt/selftest_lrc.c        |  43 ++-
> >   .../gpu/drm/i915/gt/uc/intel_guc_submission.c |   6 +-
> >   drivers/gpu/drm/i915/i915_priolist_types.h    |   7 +-
> >   drivers/gpu/drm/i915/i915_request.h           |   4 +-
> >   drivers/gpu/drm/i915/i915_scheduler.c         | 322 ++++++++++++-----
> >   drivers/gpu/drm/i915/i915_scheduler.h         |  22 +-
> >   drivers/gpu/drm/i915/i915_scheduler_types.h   |  17 +
> >   .../drm/i915/selftests/i915_mock_selftests.h  |   1 +
> >   drivers/gpu/drm/i915/selftests/i915_request.c |   1 +
> >   .../gpu/drm/i915/selftests/i915_scheduler.c   |  49 +++
> >   16 files changed, 484 insertions(+), 362 deletions(-)
> >   create mode 100644 drivers/gpu/drm/i915/selftests/i915_scheduler.c
> 
> Do we have timings to back this change up? Would it make sense to have a 
> configurable scheduler choice?

gem_wsim workloads with different load balancers, varying the number of
clients, % variation from previous patch.
+mB--------------------------------------------------------------------+
|                               a                                      |
|                             cda                                      |
|                             c.a                                      |
|                             ..aa                                     |
|                           ..---.                                     |
|                           -.--+-.                                    |
|                        .c.-.-+++.  b                                 |
|               b    bb.d-c-+--+++.aab aa    b b                       |
|b  b   b   b  b.  b ..---+++-+++++....a. b. b b   b       b    b     b|
|                               A|                                     |
|                         |___AM____|                                  |
|                            |A__|                                     |
|                            |MA_|                                     |
+----------------------------------------------------------------------+

Clients     N           Min           Max        Median           Avg        Stddev
1          63          -8.2           5.4        -0.045      -0.02375   0.094722134
2          63        -15.96         19.28         -0.64         -1.05     2.2428076
4          63         -5.11          2.95         -1.15    -1.0683333    0.72382651
8          63         -5.63          1.85        -0.905   -0.87122449    0.73390971

The wildest swings there do appear to be a result of interrupt latency,
with the -1% impact from execution order and more context switching.
-Chris