[Intel-gfx] [PATCH 10/13] drm/i915: Load balancing across a virtual engine

Mon Mar 11 13:43:21 UTC 2019

Quoting Tvrtko Ursulin (2019-03-11 12:47:30)
> 
> On 08/03/2019 14:12, Chris Wilson wrote:
> > Having allowed the user to define a set of engines that they will want
> > to only use, we go one step further and allow them to bind those engines
> > into a single virtual instance. Submitting a batch to the virtual engine
> > will then forward it to any one of the set in a manner as best to
> > distribute load.  The virtual engine has a single timeline across all
> > engines (it operates as a single queue), so it is not able to concurrently
> > run batches across multiple engines by itself; that is left up to the user
> > to submit multiple concurrent batches to multiple queues. Multiple users
> > will be load balanced across the system.
> > 
> > The mechanism used for load balancing in this patch is a late greedy
> > balancer. When a request is ready for execution, it is added to each
> > engine's queue, and when an engine is ready for its next request it
> > claims it from the virtual engine. The first engine to do so, wins, i.e.
> > the request is executed at the earliest opportunity (idle moment) in the
> > system.
> > 
> > As not all HW is created equal, the user is still able to skip the
> > virtual engine and execute the batch on a specific engine, all within the
> > same queue. It will then be executed in order on the correct engine,
> > with execution on other virtual engines being moved away due to the load
> > detection.
> > 
> > A couple of areas for potential improvement left!
> > 
> > - The virtual engine always take priority over equal-priority tasks.
> > Mostly broken up by applying FQ_CODEL rules for prioritising new clients,
> > and hopefully the virtual and real engines are not then congested (i.e.
> > all work is via virtual engines, or all work is to the real engine).
> > 
> > - We require the breadcrumb irq around every virtual engine request. For
> > normal engines, we eliminate the need for the slow round trip via
> > interrupt by using the submit fence and queueing in order. For virtual
> > engines, we have to allow any job to transfer to a new ring, and cannot
> > coalesce the submissions, so require the completion fence instead,
> > forcing the persistent use of interrupts.
> > 
> > - We only drip feed single requests through each virtual engine and onto
> > the physical engines, even if there was enough work to fill all ELSP,
> > leaving small stalls with an idle CS event at the end of every request.
> > Could we be greedy and fill both slots? Being lazy is virtuous for load
> > distribution on less-than-full workloads though.
> > 
> > Other areas of improvement are more general, such as reducing lock
> > contention, reducing dispatch overhead, looking at direct submission
> > rather than bouncing around tasklets etc.
> > 
> > Signed-off-by: Chris Wilson <chris at chris-wilson.co.uk>
> > Cc: Tvrtko Ursulin <tvrtko.ursulin at intel.com>
> > ---
> >   drivers/gpu/drm/i915/i915_gem.h            |   5 +
> >   drivers/gpu/drm/i915/i915_gem_context.c    | 153 +++++-
> >   drivers/gpu/drm/i915/i915_scheduler.c      |  17 +-
> >   drivers/gpu/drm/i915/i915_timeline_types.h |   1 +
> >   drivers/gpu/drm/i915/intel_engine_types.h  |   8 +
> >   drivers/gpu/drm/i915/intel_lrc.c           | 521 ++++++++++++++++++++-
> >   drivers/gpu/drm/i915/intel_lrc.h           |  11 +
> >   drivers/gpu/drm/i915/selftests/intel_lrc.c | 165 +++++++
> >   include/uapi/drm/i915_drm.h                |  30 ++
> >   9 files changed, 895 insertions(+), 16 deletions(-)
> > 
> > diff --git a/drivers/gpu/drm/i915/i915_gem.h b/drivers/gpu/drm/i915/i915_gem.h
> > index 74a2ddc1b52f..dbcea6e29d48 100644
> > --- a/drivers/gpu/drm/i915/i915_gem.h
> > +++ b/drivers/gpu/drm/i915/i915_gem.h
> > @@ -91,4 +91,9 @@ static inline bool __tasklet_is_enabled(const struct tasklet_struct *t)
> >       return !atomic_read(&t->count);
> >   }
> >   
> > +static inline bool __tasklet_is_scheduled(struct tasklet_struct *t)
> > +{
> > +     return test_bit(TASKLET_STATE_SCHED, &t->state);
> > +}
> > +
> >   #endif /* __I915_GEM_H__ */
> > diff --git a/drivers/gpu/drm/i915/i915_gem_context.c b/drivers/gpu/drm/i915/i915_gem_context.c
> > index a581c01ffff1..13b79980f7f3 100644
> > --- a/drivers/gpu/drm/i915/i915_gem_context.c
> > +++ b/drivers/gpu/drm/i915/i915_gem_context.c
> > @@ -86,12 +86,16 @@
> >    */
> >   
> >   #include <linux/log2.h>
> > +#include <linux/nospec.h>
> > +
> >   #include <drm/i915_drm.h>
> > +
> >   #include "i915_drv.h"
> >   #include "i915_globals.h"
> >   #include "i915_trace.h"
> >   #include "i915_user_extensions.h"
> >   #include "intel_lrc_reg.h"
> > +#include "intel_lrc.h"
> >   #include "intel_workarounds.h"
> >   
> >   #define ALL_L3_SLICES(dev) (1 << NUM_L3_SLICES(dev)) - 1
> > @@ -238,6 +242,20 @@ static void release_hw_id(struct i915_gem_context *ctx)
> >       mutex_unlock(&i915->contexts.mutex);
> >   }
> >   
> > +static void free_engines(struct intel_engine_cs **engines, int count)
> > +{
> > +     int i;
> > +
> > +     if (!engines)
> > +             return;
> > +
> > +     /* We own the veng we created; regular engines are ignored */
> > +     for (i = 0; i < count; i++)
> > +             intel_virtual_engine_destroy(engines[i]);
> > +
> > +     kfree(engines);
> > +}
> > +
> >   static void i915_gem_context_free(struct i915_gem_context *ctx)
> >   {
> >       struct intel_context *it, *n;
> > @@ -248,8 +266,7 @@ static void i915_gem_context_free(struct i915_gem_context *ctx)
> >   
> >       release_hw_id(ctx);
> >       i915_ppgtt_put(ctx->ppgtt);
> > -
> > -     kfree(ctx->engines);
> > +     free_engines(ctx->engines, ctx->nengine);
> >   
> >       rbtree_postorder_for_each_entry_safe(it, n, &ctx->hw_contexts, node)
> >               it->ops->destroy(it);
> > @@ -1359,13 +1376,116 @@ static int set_sseu(struct i915_gem_context *ctx,
> >       return 0;
> >   };
> >   
> > +static int check_user_mbz16(u16 __user *user)
> > +{
> > +     u16 mbz;
> > +
> > +     if (get_user(mbz, user))
> > +             return -EFAULT;
> > +
> > +     return mbz ? -EINVAL : 0;
> > +}
> > +
> > +static int check_user_mbz32(u32 __user *user)
> > +{
> > +     u32 mbz;
> > +
> > +     if (get_user(mbz, user))
> > +             return -EFAULT;
> > +
> > +     return mbz ? -EINVAL : 0;
> > +}
> > +
> > +static int check_user_mbz64(u64 __user *user)
> > +{
> > +     u64 mbz;
> > +
> > +     if (get_user(mbz, user))
> > +             return -EFAULT;
> > +
> > +     return mbz ? -EINVAL : 0;
> > +}
> 
> Could generate the three with a macro but it would be marginal 
> improvement if any.
> 
> > +
> >   struct set_engines {
> >       struct i915_gem_context *ctx;
> >       struct intel_engine_cs **engines;
> >       unsigned int nengine;
> >   };
> >   
> > +static int
> > +set_engines__load_balance(struct i915_user_extension __user *base, void *data)
> > +{
> > +     struct i915_context_engines_load_balance __user *ext =
> > +             container_of_user(base, typeof(*ext), base);
> > +     const struct set_engines *set = data;
> > +     struct intel_engine_cs *ve;
> > +     unsigned int n;
> > +     u64 mask;
> > +     u16 idx;
> > +     int err;
> > +
> > +     if (!HAS_EXECLISTS(set->ctx->i915))
> > +             return -ENODEV;
> > +
> > +     if (USES_GUC_SUBMISSION(set->ctx->i915))
> > +             return -ENODEV; /* not implement yet */
> > +
> > +     if (get_user(idx, &ext->engine_index))
> > +             return -EFAULT;
> > +
> > +     if (idx >= set->nengine)
> > +             return -EINVAL;
> > +
> > +     idx = array_index_nospec(idx, set->nengine);
> > +     if (set->engines[idx])
> > +             return -EEXIST;
> > +
> > +     err = check_user_mbz16(&ext->mbz16);
> > +     if (err)
> > +             return err;
> > +
> > +     err = check_user_mbz32(&ext->flags);
> > +     if (err)
> > +             return err;
> > +
> > +     for (n = 0; n < ARRAY_SIZE(ext->mbz64); n++) {
> > +             err = check_user_mbz64(&ext->mbz64[n]);
> > +             if (err)
> > +                     return err;
> > +     }
> > +
> > +     if (get_user(mask, &ext->engines_mask))
> > +             return -EFAULT;
> > +
> > +     mask &= GENMASK_ULL(set->nengine - 1, 0) & ~BIT_ULL(idx);
> > +     if (!mask)
> > +             return -EINVAL;
> > +
> > +     if (is_power_of_2(mask)) {
> > +             ve = set->engines[__ffs64(mask)];
> > +     } else {
> > +             struct intel_engine_cs *stack[64];
> > +             int bit;
> > +
> > +             n = 0;
> > +             for_each_set_bit(bit, (unsigned long *)&mask, set->nengine)
> > +                     stack[n++] = set->engines[bit];
> > +
> > +             ve = intel_execlists_create_virtual(set->ctx, stack, n);
> > +     }
> > +     if (IS_ERR(ve))
> > +             return PTR_ERR(ve);
> > +
> > +     if (cmpxchg(&set->engines[idx], NULL, ve)) {
> > +             intel_virtual_engine_destroy(ve);
> > +             return -EEXIST;
> > +     }
> > +
> > +     return 0;
> > +}
> > +
> >   static const i915_user_extension_fn set_engines__extensions[] = {
> > +     [I915_CONTEXT_ENGINES_EXT_LOAD_BALANCE] = set_engines__load_balance,
> >   };
> >   
> >   static int
> > @@ -1426,13 +1546,13 @@ set_engines(struct i915_gem_context *ctx,
> >                                          ARRAY_SIZE(set_engines__extensions),
> >                                          &set);
> >       if (err) {
> > -             kfree(set.engines);
> > +             free_engines(set.engines, set.nengine);
> >               return err;
> >       }
> >   
> >   out:
> >       mutex_lock(&ctx->i915->drm.struct_mutex);
> > -     kfree(ctx->engines);
> > +     free_engines(ctx->engines, ctx->nengine);
> >       ctx->engines = set.engines;
> >       ctx->nengine = set.nengine;
> >       mutex_unlock(&ctx->i915->drm.struct_mutex);
> > @@ -1637,6 +1757,7 @@ static int clone_engines(struct i915_gem_context *dst,
> >                        struct i915_gem_context *src)
> >   {
> >       struct intel_engine_cs **engines;
> > +     int i;
> >   
> >       engines = kmemdup(src->engines,
> >                         sizeof(*src->engines) * src->nengine,
> > @@ -1644,6 +1765,30 @@ static int clone_engines(struct i915_gem_context *dst,
> >       if (!engines)
> >               return -ENOMEM;
> >   
> > +     /*
> > +      * Virtual engines are singletons; they can only exist
> > +      * inside a single context, because they embed their
> > +      * HW context... As each virtual context implies a single
> > +      * timeline (each engine can only dequeue a single request
> > +      * at any time), it would be surprising for two contexts
> > +      * to use the same engine. So let's create a copy of
> > +      * the virtual engine instead.
> > +      */
> > +     for (i = 0; i < src->nengine; i++) {
> > +             struct intel_engine_cs *engine = engines[i];
> > +
> > +             if (!intel_engine_is_virtual(engine))
> > +                     continue;
> > +
> > +             engine = intel_execlists_clone_virtual(dst, engine);
> > +             if (IS_ERR(engine)) {
> > +                     free_engines(engines, i);
> > +                     return PTR_ERR(engine);
> > +             }
> > +
> > +             engines[i] = engine;
> > +     }
> > +
> >       dst->engines = engines;
> >       dst->nengine = src->nengine;
> >       return 0;
> > diff --git a/drivers/gpu/drm/i915/i915_scheduler.c b/drivers/gpu/drm/i915/i915_scheduler.c
> > index e0f609d01564..bb9819dbe313 100644
> > --- a/drivers/gpu/drm/i915/i915_scheduler.c
> > +++ b/drivers/gpu/drm/i915/i915_scheduler.c
> > @@ -247,17 +247,25 @@ sched_lock_engine(const struct i915_sched_node *node,
> >                 struct intel_engine_cs *locked,
> >                 struct sched_cache *cache)
> >   {
> > -     struct intel_engine_cs *engine = node_to_request(node)->engine;
> > +     const struct i915_request *rq = node_to_request(node);
> > +     struct intel_engine_cs *engine;
> >   
> >       GEM_BUG_ON(!locked);
> >   
> > -     if (engine != locked) {
> > +     /*
> > +      * Virtual engines complicate acquiring the engine timeline lock,
> > +      * as their rq->engine pointer is not stable until under that
> > +      * engine lock. The simple ploy we use is to take the lock then
> > +      * check that the rq still belongs to the newly locked engine.
> > +      */
> > +     while (locked != (engine = READ_ONCE(rq->engine))) {
> >               spin_unlock(&locked->timeline.lock);
> >               memset(cache, 0, sizeof(*cache));
> >               spin_lock(&engine->timeline.lock);
> > +             locked = engine;
> >       }
> >   
> > -     return engine;
> > +     return locked;
> 
> engine == locked at this point, right?

Yess.
> 
> >   }
> >   
> >   static bool inflight(const struct i915_request *rq,
> > @@ -370,8 +378,11 @@ static void __i915_schedule(struct i915_request *rq,
> >               if (prio <= node->attr.priority || node_signaled(node))
> >                       continue;
> >   
> > +             GEM_BUG_ON(node_to_request(node)->engine != engine);
> > +
> >               node->attr.priority = prio;
> >               if (!list_empty(&node->link)) {
> > +                     GEM_BUG_ON(intel_engine_is_virtual(engine));
> >                       if (!cache.priolist)
> >                               cache.priolist =
> >                                       i915_sched_lookup_priolist(engine,
> > diff --git a/drivers/gpu/drm/i915/i915_timeline_types.h b/drivers/gpu/drm/i915/i915_timeline_types.h
> > index 8ff146dc05ba..5e445f145eb1 100644
> > --- a/drivers/gpu/drm/i915/i915_timeline_types.h
> > +++ b/drivers/gpu/drm/i915/i915_timeline_types.h
> > @@ -25,6 +25,7 @@ struct i915_timeline {
> >       spinlock_t lock;
> >   #define TIMELINE_CLIENT 0 /* default subclass */
> >   #define TIMELINE_ENGINE 1
> > +#define TIMELINE_VIRTUAL 2
> >       struct mutex mutex; /* protects the flow of requests */
> >   
> >       unsigned int pin_count;
> > diff --git a/drivers/gpu/drm/i915/intel_engine_types.h b/drivers/gpu/drm/i915/intel_engine_types.h
> > index b0aa1f0d4e47..d54d2a1840cc 100644
> > --- a/drivers/gpu/drm/i915/intel_engine_types.h
> > +++ b/drivers/gpu/drm/i915/intel_engine_types.h
> > @@ -216,6 +216,7 @@ struct intel_engine_execlists {
> >        * @queue: queue of requests, in priority lists
> >        */
> >       struct rb_root_cached queue;
> > +     struct rb_root_cached virtual;
> >   
> >       /**
> >        * @csb_write: control register for Context Switch buffer
> > @@ -421,6 +422,7 @@ struct intel_engine_cs {
> >   #define I915_ENGINE_SUPPORTS_STATS   BIT(1)
> >   #define I915_ENGINE_HAS_PREEMPTION   BIT(2)
> >   #define I915_ENGINE_HAS_SEMAPHORES   BIT(3)
> > +#define I915_ENGINE_IS_VIRTUAL       BIT(4)
> >       unsigned int flags;
> >   
> >       /*
> > @@ -504,6 +506,12 @@ intel_engine_has_semaphores(const struct intel_engine_cs *engine)
> >       return engine->flags & I915_ENGINE_HAS_SEMAPHORES;
> >   }
> >   
> > +static inline bool
> > +intel_engine_is_virtual(const struct intel_engine_cs *engine)
> > +{
> > +     return engine->flags & I915_ENGINE_IS_VIRTUAL;
> > +}
> > +
> >   #define instdone_slice_mask(dev_priv__) \
> >       (IS_GEN(dev_priv__, 7) ? \
> >        1 : RUNTIME_INFO(dev_priv__)->sseu.slice_mask)
> > diff --git a/drivers/gpu/drm/i915/intel_lrc.c b/drivers/gpu/drm/i915/intel_lrc.c
> > index 7b938eaff9c5..0c97e8f30223 100644
> > --- a/drivers/gpu/drm/i915/intel_lrc.c
> > +++ b/drivers/gpu/drm/i915/intel_lrc.c
> > @@ -166,6 +166,28 @@
> >   
> >   #define ACTIVE_PRIORITY (I915_PRIORITY_NEWCLIENT | I915_PRIORITY_NOSEMAPHORE)
> >   
> > +struct virtual_engine {
> > +     struct intel_engine_cs base;
> > +
> > +     struct intel_context context;
> > +     struct kref kref;
> > +     struct rcu_head rcu;
> > +
> > +     struct i915_request *request;
> > +     struct ve_node {
> > +             struct rb_node rb;
> > +             int prio;
> > +     } nodes[I915_NUM_ENGINES];
> 
> Please comment the fields at list in the above block.

The request, and elements of someone else's priority tree? They really are
not that interesting.

> 
> > +
> > +     unsigned int count;
> > +     struct intel_engine_cs *siblings[0];
> > +};
> > +
> > +static struct virtual_engine *to_virtual_engine(struct intel_engine_cs *engine)
> > +{
> > +     return container_of(engine, struct virtual_engine, base);
> > +}
> > +
> >   static int execlists_context_deferred_alloc(struct intel_context *ce,
> >                                           struct intel_engine_cs *engine);
> >   static void execlists_init_reg_state(u32 *reg_state,
> > @@ -235,7 +257,8 @@ static int queue_prio(const struct intel_engine_execlists *execlists)
> >   }
> >   
> >   static inline bool need_preempt(const struct intel_engine_cs *engine,
> > -                             const struct i915_request *rq)
> > +                             const struct i915_request *rq,
> > +                             struct rb_node *rb)
> >   {
> >       int last_prio;
> >   
> > @@ -270,6 +293,22 @@ static inline bool need_preempt(const struct intel_engine_cs *engine,
> >           rq_prio(list_next_entry(rq, link)) > last_prio)
> >               return true;
> >   
> > +     if (rb) { /* XXX virtual precedence */
> > +             struct virtual_engine *ve =
> > +                     rb_entry(rb, typeof(*ve), nodes[engine->id].rb);
> > +             bool preempt = false;
> > +
> > +             if (engine == ve->siblings[0]) { /* only preempt one sibling */
> 
> Why always siblings[0] ?

It's the one most likely currently active; why do a preempt-to-idle cycle
on them all? We're explicitly making the engine idle in order to put our
request in the queue. Just feels like a waste -- the marginal cost that
sibling[0] has more important work than the veng and so missing the
opportunity to preempt sibling[1], meh.

Now, with preempt-to-busy, we may favour the alternate strategy, since
we will consume the preemption before our other siblings can evaluate
this request.

> > @@ -396,14 +437,23 @@ __unwind_incomplete_requests(struct intel_engine_cs *engine)
> >   
> >               GEM_BUG_ON(rq->hw_context->active);
> >   
> > -             GEM_BUG_ON(rq_prio(rq) == I915_PRIORITY_INVALID);
> > -             if (rq_prio(rq) != prio) {
> > -                     prio = rq_prio(rq);
> > -                     pl = i915_sched_lookup_priolist(engine, prio);
> > -             }
> > -             GEM_BUG_ON(RB_EMPTY_ROOT(&engine->execlists.queue.rb_root));
> > +             owner = rq->hw_context->engine;
> > +             if (likely(owner == engine)) {
> > +                     GEM_BUG_ON(rq_prio(rq) == I915_PRIORITY_INVALID);
> > +                     if (rq_prio(rq) != prio) {
> > +                             prio = rq_prio(rq);
> > +                             pl = i915_sched_lookup_priolist(engine, prio);
> > +                     }
> > +                     GEM_BUG_ON(RB_EMPTY_ROOT(&engine->execlists.queue.rb_root));
> > +
> > +                     list_add(&rq->sched.link, pl);
> > +             } else {
> > +                     if (__i915_request_has_started(rq))
> > +                             rq->sched.attr.priority |= ACTIVE_PRIORITY;
> >   
> > -             list_add(&rq->sched.link, pl);
> > +                     rq->engine = owner;
> > +                     owner->submit_request(rq);
> > +             }
> 
> What's happening here - put some comment in please.

We're unwinding the incomplete request. What's confusing? Previously we
open-coded the resubmit but obviously can't do that across a veng if we
want to be able to support switching across siblings.

> > +     for (rb = rb_first_cached(&execlists->virtual); rb; ) {
> > +             struct virtual_engine *ve =
> > +                     rb_entry(rb, typeof(*ve), nodes[engine->id].rb);
> > +             struct i915_request *rq = READ_ONCE(ve->request);
> > +             struct intel_engine_cs *active;
> > +
> > +             if (!rq) { /* already taken by another sibling */
> > +                     rb_erase_cached(rb, &execlists->virtual);
> > +                     RB_CLEAR_NODE(rb);
> > +                     rb = rb_first_cached(&execlists->virtual);
> > +                     continue;
> > +             }
> 
> Probably a good place to comment how each physical engine sees all veng 
> requests and needs to unlink if someone else dequeued.
> 
> This relies on setting ve->request to NULL propagating to all CPUs as 
> soon as cleared I think. Becuase all tasklets will be under different 
> engine->timeline.lock, and the clear is under the VE->timeline.lock but 
> this isn't.
> 
> If one CPU does not see the clear, it will skip removing the entry from 
> the rbtree. But then it will take the VE->timeline.lock a bit down and 
> fix up. Okay I think.

Exactly. Optimistic reads under READ_ONCE; and restart if locked
confirmation fails.

> > +
> > +             active = READ_ONCE(ve->context.active);
> > +             if (active && active != engine) {
> > +                     rb = rb_next(rb);
> > +                     continue;
> > +             }
> 
> What's happening here (a comment please)?

This needs a comment? I thought this would be clear; if the context is
currently still residing another engine we can't modify it as it will be
overwritten by the context save.
> 
> > +
> > +             break;
> > +     }
> > +
> >       if (last) {
> >               /*
> >                * Don't resubmit or switch until all outstanding
> > @@ -718,7 +834,7 @@ static void execlists_dequeue(struct intel_engine_cs *engine)
> >               if (!execlists_is_active(execlists, EXECLISTS_ACTIVE_HWACK))
> >                       return;
> >   
> > -             if (need_preempt(engine, last)) {
> > +             if (need_preempt(engine, last, rb)) {
> >                       inject_preempt_context(engine);
> >                       return;
> >               }
> > @@ -758,6 +874,72 @@ static void execlists_dequeue(struct intel_engine_cs *engine)
> >               last->tail = last->wa_tail;
> >       }
> >   
> > +     while (rb) { /* XXX virtual is always taking precedence */
> > +             struct virtual_engine *ve =
> > +                     rb_entry(rb, typeof(*ve), nodes[engine->id].rb);
> > +             struct i915_request *rq;
> > +
> > +             spin_lock(&ve->base.timeline.lock);
> 
> This is under the physical engine timeline lock, so isn't nested 
> allocation needed?

Hence why ve and engine have different lock subclasses.

> > +
> > +             rq = ve->request;
> > +             if (unlikely(!rq)) { /* lost the race to a sibling */
> > +                     spin_unlock(&ve->base.timeline.lock);
> > +                     rb_erase_cached(rb, &execlists->virtual);
> > +                     RB_CLEAR_NODE(rb);
> > +                     rb = rb_first_cached(&execlists->virtual);
> > +                     continue;
> > +             }
> > +
> > +             if (rq_prio(rq) >= queue_prio(execlists)) {
> > +                     if (last && !can_merge_rq(last, rq)) {
> > +                             spin_unlock(&ve->base.timeline.lock);
> > +                             return; /* leave this rq for another engine */
> > +                     }
> > +
> > +                     GEM_BUG_ON(rq->engine != &ve->base);
> > +                     ve->request = NULL;
> > +                     ve->base.execlists.queue_priority_hint = INT_MIN;
> 
> Why set to INT_MIN? Can't there be queued requests after this one?

No. We can reduce the overhead of veng by allowing multiple requests in
the pipeline, but it comes at the cost of load spreading and poor
utilisation with multiple clients. So we stick with the late greedy
scheme, and only decide where to put each request at the point it is
ready to run.

> > +             if (!RB_EMPTY_NODE(node))
> > +                     rb_erase_cached(node, &sibling->execlists.virtual);
> 
> There can only be one queued request?

Yes, Highlander.

> > +
> > +             spin_unlock_irq(&sibling->timeline.lock);
> > +     }
> > +     GEM_BUG_ON(__tasklet_is_scheduled(&ve->base.execlists.tasklet));
> 
> Why would this not fire since virtual_engine_free can get called from 
> set_param at any time?
> 
> > +
> > +     if (ve->context.state)
> > +             __execlists_context_fini(&ve->context);
> 
> And here why it can't be in use?

Because we take a ref while active.

> > +static void virtual_engine_initial_hint(struct virtual_engine *ve)
> > +{
> > +     int swp;
> > +
> > +     /*
> > +      * Pick a random sibling on starting to help spread the load around.
> > +      *
> > +      * New contexts are typically created with exactly the same order
> > +      * of siblings, and often started in batches. Due to the way we iterate
> > +      * the array of sibling when submitting requests, sibling[0] is
> > +      * prioritised for dequeuing. If we make sure that sibling[0] is fairly
> > +      * randomised across the system, we also help spread the load by the
> > +      * first engine we inspect being different each time.
> > +      *
> > +      * NB This does not force us to execute on this engine, it will just
> > +      * typically be the first we inspect for submission.
> > +      */
> > +     swp = prandom_u32_max(ve->count);
> > +     if (!swp)
> > +             return;
> 
> Was randon better than round robin? Although yeah, it is local to each 
> engine map so global rr or random pick would be a more complicated 
> implementation best left for later if needed.

It makes no difference at all. It just looks cool, and I felt the
comments about there being an implicit bias based on sibling index
useful, but overall that bias is lost in the noise.

> > +     local_irq_disable();
> > +     for (n = 0; READ_ONCE(ve->request) && n < ve->count; n++) {
> > +             struct intel_engine_cs *sibling = ve->siblings[n];
> > +             struct ve_node * const node = &ve->nodes[sibling->id];
> > +             struct rb_node **parent, *rb;
> > +             bool first;
> > +
> > +             spin_lock(&sibling->timeline.lock);
> > +
> > +             if (!RB_EMPTY_NODE(&node->rb)) {
> > +                     first = rb_first_cached(&sibling->execlists.virtual) == &node->rb;
> > +                     if (prio == node->prio || (prio > node->prio && first))
> > +                             goto submit_engine;
> > +
> > +                     rb_erase_cached(&node->rb, &sibling->execlists.virtual);
> > +             }
> 
> What does this block do?

Try to avoid rebalancing the tree if we can reuse the slot from last
time as its engine hasn't seen that we've already run the previous
request to completion.

> > +
> > +             rb = NULL;
> > +             first = true;
> > +             parent = &sibling->execlists.virtual.rb_root.rb_node;
> > +             while (*parent) {
> > +                     struct ve_node *other;
> > +
> > +                     rb = *parent;
> > +                     other = rb_entry(rb, typeof(*other), rb);
> > +                     if (prio > other->prio) {
> > +                             parent = &rb->rb_left;
> > +                     } else {
> > +                             parent = &rb->rb_right;
> > +                             first = false;
> > +                     }
> > +             }
> > +
> > +             rb_link_node(&node->rb, rb, parent);
> > +             rb_insert_color_cached(&node->rb,
> > +                                    &sibling->execlists.virtual,
> > +                                    first);
> > +
> > +submit_engine:
> > +             GEM_BUG_ON(RB_EMPTY_NODE(&node->rb));
> > +             node->prio = prio;
> > +             if (first && prio > sibling->execlists.queue_priority_hint) {
> > +                     sibling->execlists.queue_priority_hint = prio;
> > +                     tasklet_hi_schedule(&sibling->execlists.tasklet);
> > +             }
> > +
> > +             spin_unlock(&sibling->timeline.lock);
> > +     }
> > +     local_irq_enable();
> > +}
> > +
> > +static void virtual_submit_request(struct i915_request *request)
> > +{
> > +     struct virtual_engine *ve = to_virtual_engine(request->engine);
> > +
> > +     GEM_BUG_ON(ve->base.submit_request != virtual_submit_request);
> > +
> > +     GEM_BUG_ON(ve->request);
> > +     ve->base.execlists.queue_priority_hint = rq_prio(request);
> > +     WRITE_ONCE(ve->request, request);
> > +
> > +     tasklet_schedule(&ve->base.execlists.tasklet);
> 
> Not tasklet_hi_schedule like we otherwise do?

I didn't feel like this was as high priority. But it will prevent the
interrupt from marking it as high priority, so probably not wise after
all.

> > +     for (n = 0; n < count; n++) {
> > +             struct intel_engine_cs *sibling = siblings[n];
> > +
> > +             GEM_BUG_ON(!is_power_of_2(sibling->mask));
> > +             if (sibling->mask & ve->base.mask)
> > +                     continue;
> 
> Eliminate duplicates? Need or want?

Eliminating duplicates because it has no impact on load balancing
decisions and just adds to the arrays.

Having 2 vcs0 and 1 vcs1 does not make selecting vcs0 twice as likely.
Which engine is used is always on a first come, first served basis
(discounting the same hystersis for pinned context images).

> > +
> > +             if (sibling->execlists.tasklet.func != execlists_submission_tasklet) {
> > +                     err = -ENODEV; > +                      goto err_put;
> 
> Is this intended to prevent making virtual engines from virtual engines 
> or more? Would it make sense to put an explicit is_virtual check for 
> clarity? (Since in the extension processing it doesn't check - or maybe 
> it would be good to check there early?)

It really depends on the lock subclass.

This is the extension processing. This is where we validate the
combination of engines chosen make sense...

> > +             }
> > +
> > +             GEM_BUG_ON(RB_EMPTY_NODE(&ve->nodes[sibling->id].rb));
> > +             RB_CLEAR_NODE(&ve->nodes[sibling->id].rb);
> > +
> > +             ve->siblings[ve->count++] = sibling;
> > +             ve->base.mask |= sibling->mask;
> > +
> 
> /* Allow only same engine class. */
> 
> > +             if (ve->base.class != OTHER_CLASS) {
> > +                     if (ve->base.class != sibling->class) {
> > +                             err = -EINVAL;
> > +                             goto err_put;
> > +                     }
> > +                     continue;
> > +             }
> > +
> > +             ve->base.class = sibling->class;
> > +             snprintf(ve->base.name, sizeof(ve->base.name),
> > +                      "v%dx%d", ve->base.class, count);
> 
> Do we want to go for unique names? Like instead of count have an unique 
> monotonically increasing counter at the end? Or maybe 
> virt<unique>:<class>:<sibling-count>? Might need increasing the engine 
> name buffer.

Not really, as at the moment, the veng name only escapes via the
selftests. Especially not a meaningless uuid, when you have the ctx
name/pid around.

We could add it to debugfs/contexts_info, but there it should be clear
each is limited to a context.

So I haven't a clear use for a better name, and propose to not worry
until we have something to disambiguate.

> > +             ve->base.context_size = sibling->context_size;
> > +
> > +             ve->base.emit_bb_start = sibling->emit_bb_start;
> > +             ve->base.emit_flush = sibling->emit_flush;
> > +             ve->base.emit_init_breadcrumb = sibling->emit_init_breadcrumb;
> > +             ve->base.emit_fini_breadcrumb = sibling->emit_fini_breadcrumb;
> > +             ve->base.emit_fini_breadcrumb_dw =
> > +                     sibling->emit_fini_breadcrumb_dw;
> > +     }
> > +
> > +     /* gracefully replace a degenerate virtual engine */
> > +     if (is_power_of_2(ve->base.mask)) {
> > +             struct intel_engine_cs *actual = ve->siblings[0];
> > +             virtual_engine_free(&ve->kref);
> > +             return actual;
> > +     }
> 
> Why the term degenerate?

"In mathematics, a degenerate case is a limiting case in which an element
of a class of objects is qualitatively different from the rest of the
class and hence belongs to another, usually simpler, class."

The question though is whether such a replacement is truly transparent.
And if it isn't, then we can't. (It would be a pity as veng add
substantially to the overhead in request processing).

Per-ctx-engine settings (such as sseu) probably should allow
differentiation between veng(rcs0) and rcs0. If that is desired, I think
it would be simpler to allow each instance of rcs0 in the engines[] to
have their own logical state (struct intel_context). Hmm.

> Also, is it possible at this stage? Higher level code will avoid 
> creating a veng with only one engine.

We have more complete checks here.

> > +struct i915_context_engines_load_balance {
> > +     struct i915_user_extension base;
> > +
> > +     __u16 engine_index;
> 
> Missing doc here.
> 
> Any need for mbz16 or could make the index u32?

You want how many engines! per veng! Nightmare scaling. And almost
certainly needs numa-awareness, like a true scheduler.

I was thinking the spare bits could come in handy later, and u8 requires
more mbz slots.
-Chris