[Intel-gfx] [PATCH 28/32] drm/i915: Load balancing across a virtual engine

Wed Apr 17 13:51:16 UTC 2019

Quoting Tvrtko Ursulin (2019-04-17 12:26:04)
> 
> On 17/04/2019 08:56, Chris Wilson wrote:
> > diff --git a/drivers/gpu/drm/i915/gt/intel_breadcrumbs.c b/drivers/gpu/drm/i915/gt/intel_breadcrumbs.c
> > index e19f84b006cc..f900f0680647 100644
> > --- a/drivers/gpu/drm/i915/gt/intel_breadcrumbs.c
> > +++ b/drivers/gpu/drm/i915/gt/intel_breadcrumbs.c
> > @@ -290,8 +290,12 @@ bool i915_request_enable_breadcrumb(struct i915_request *rq)
> >                               break;
> >               }
> >               list_add(&rq->signal_link, pos);
> > -             if (pos == &ce->signals) /* catch transitions from empty list */
> > +             if (pos == &ce->signals) { /* catch transitions from empty */
> >                       list_move_tail(&ce->signal_link, &b->signalers);
> > +             } else if (ce->engine != rq->engine) { /* virtualised */
> > +                     list_move_tail(&ce->signal_link, &b->signalers);
> > +                     intel_engine_queue_breadcrumbs(rq->engine);
> 
> Is there significance in check not being based on engine->flags & VIRTUAL?

Imo, it feels a more generalised statement that we are transferring this
request onto a new backend -- I wanted to say something like
rq->engine != last_breadcrumb_engine (and I think this is as close as we
can get and a very good simulacrum).

> Actually, the signaling can get enabled either on the virtual or real 
> engine, depending on timing. I don't see that irq_enable/disable vfuncs 
> will be present on the veng though. So how does that work?

rq->engine is always a real engine here, as we are under the rq->lock
and have the I915_FENCE_FLAG_ACTIVE bit set.

> Maybe there is a clue in presence of intel_engine_queue_breadcrumbs but 
> I don't see it.
> 
> > +     for (rb = rb_first_cached(&execlists->virtual); rb; ) {
> > +             struct virtual_engine *ve =
> > +                     rb_entry(rb, typeof(*ve), nodes[engine->id].rb);
> > +             struct i915_request *rq = READ_ONCE(ve->request);
> > +
> > +             if (!rq) { /* lazily cleanup after another engine handled rq */
> > +                     rb_erase_cached(rb, &execlists->virtual);
> > +                     RB_CLEAR_NODE(rb);
> > +                     rb = rb_first_cached(&execlists->virtual);
> > +                     continue;
> > +             }
> > +
> > +             if (!virtual_matches(ve, rq, engine)) {
> > +                     rb = rb_next(rb);
> > +                     continue;
> > +             }
> 
> Can this create a problem where, since the queue is supposed to be in 
> submission order, here it skips some requests if the owning veng is 
> already executing something and so inverts the dependency chain?
> 
> Before semaphore I guess it wasn't possible since only runnable requests 
> would be in the queue. But with semaphores I am not sure. Paranoid thing 
> would be to not dequeue any other veng request if engine is busy with 
> some veng. But maybe that would serialize things too much. Or it is too 
> paranoid?

Even with semaphores, the other request cannot be put into the tree
until its signaling request has started. If preempted, we should be
adjusting the priorities such that the preempted request is put back
before its children.

> > +     while (rb) { /* XXX virtual is always taking precedence */
> > +             struct virtual_engine *ve =
> > +                     rb_entry(rb, typeof(*ve), nodes[engine->id].rb);
> > +             struct i915_request *rq;
> > +
> > +             spin_lock(&ve->base.timeline.lock);
> > +
> > +             rq = ve->request;
> > +             if (unlikely(!rq)) { /* lost the race to a sibling */
> > +                     spin_unlock(&ve->base.timeline.lock);
> > +                     rb_erase_cached(rb, &execlists->virtual);
> > +                     RB_CLEAR_NODE(rb);
> > +                     rb = rb_first_cached(&execlists->virtual);
> > +                     continue;
> > +             }
> > +
> > +             GEM_BUG_ON(rq != ve->request);
> > +             GEM_BUG_ON(rq->engine != &ve->base);
> > +             GEM_BUG_ON(rq->hw_context != &ve->context);
> > +
> > +             if (rq_prio(rq) >= queue_prio(execlists)) {
> > +                     if (!virtual_matches(ve, rq, engine)) {
> > +                             spin_unlock(&ve->base.timeline.lock);
> > +                             rb = rb_next(rb);
> > +                             continue;
> 
> Is this needed? The first virtual_matches loop skipped all requests 
> which shouldn't be dequeued and leaves rb pointing to the top priority 
> one which can.

Now we have the lock, and so we need to check that we actually own the
ve->request. If we decide the first wasn't good enough, but the second
is still a better choice than the normal queue, we need to confirm that
we can submit it on this engine.

> > +static void virtual_context_destroy(struct kref *kref)
> > +{
> > +     struct virtual_engine *ve =
> > +             container_of(kref, typeof(*ve), context.ref);
> > +     unsigned int n;
> > +
> > +     GEM_BUG_ON(ve->request);
> > +     GEM_BUG_ON(ve->context.active);
> > +
> > +     for (n = 0; n < ve->num_siblings; n++) {
> > +             struct intel_engine_cs *sibling = ve->siblings[n];
> > +             struct rb_node *node = &ve->nodes[sibling->id].rb;
> > +
> > +             if (RB_EMPTY_NODE(node))
> > +                     continue;
> > +
> > +             spin_lock_irq(&sibling->timeline.lock);
> > +
> > +             if (!RB_EMPTY_NODE(node))
> > +                     rb_erase_cached(node, &sibling->execlists.virtual);
> 
> A consequence of rq duplication across phyisical engines? Leave a comment?

A consequence of the asynchronous cleanup inside the execlists_dequeue.

> > +/*
> > + * i915_context_engines_load_balance:
> > + *
> > + * Enable load balancing across this set of engines.
> > + *
> > + * Into the I915_EXEC_DEFAULT slot [0], a virtual engine is created that when
> > + * used will proxy the execbuffer request onto one of the set of engines
> > + * in such a way as to distribute the load evenly across the set.
> > + *
> > + * The set of engines must be compatible (e.g. the same HW class) as they
> > + * will share the same logical GPU context and ring.
> > + *
> > + * To intermix rendering with the virtual engine and direct rendering onto
> > + * the backing engines (bypassing the load balancing proxy), the context must
> > + * be defined to use a single timeline for all engines.
> > + */
> > +struct i915_context_engines_load_balance {
> > +     struct i915_user_extension base;
> > +
> > +     __u16 engine_index;
> > +     __u16 num_siblings;
> > +     __u32 flags; /* all undefined flags must be zero */
> > +
> > +     __u64 mbz64[1]; /* reserved for future use; must be zero */
> 
> Why an array if only one? Should we add a few more just in case?

Hopefully that's sarcasm. :-p
-Chris