[Intel-gfx] [PATCH 14/40] drm/i915: Combine multiple internal plists into the same i915_priolist bucket

Tue Sep 25 07:55:20 UTC 2018

Quoting Tvrtko Ursulin (2018-09-24 11:25:37)
> 
> On 19/09/2018 20:55, Chris Wilson wrote:
> > As we are about to allow ourselves to slightly bump the user priority
> > into a few different sublevels, packthose internal priority lists
> > into the same i915_priolist to keep the rbtree compact and avoid having
> > to allocate the default user priority even after the internal bumping.
> > The downside to having an requests[] rather than a node per active list,
> > is that we then have to walk over the empty higher priority lists. To
> > compensate, we track the active buckets and use a small bitmap to skip
> > over any inactive ones.
> > 
> > Signed-off-by: Chris Wilson <chris at chris-wilson.co.uk>
> > ---
> >   drivers/gpu/drm/i915/intel_engine_cs.c      |  6 +-
> >   drivers/gpu/drm/i915/intel_guc_submission.c | 12 ++-
> >   drivers/gpu/drm/i915/intel_lrc.c            | 87 ++++++++++++++-------
> >   drivers/gpu/drm/i915/intel_ringbuffer.h     | 13 ++-
> >   4 files changed, 80 insertions(+), 38 deletions(-)
> > 
> > diff --git a/drivers/gpu/drm/i915/intel_engine_cs.c b/drivers/gpu/drm/i915/intel_engine_cs.c
> > index 217ed3ee1cab..83f2f7774c1f 100644
> > --- a/drivers/gpu/drm/i915/intel_engine_cs.c
> > +++ b/drivers/gpu/drm/i915/intel_engine_cs.c
> > @@ -1534,10 +1534,10 @@ void intel_engine_dump(struct intel_engine_cs *engine,
> >       count = 0;
> >       drm_printf(m, "\t\tQueue priority: %d\n", execlists->queue_priority);
> >       for (rb = rb_first_cached(&execlists->queue); rb; rb = rb_next(rb)) {
> > -             struct i915_priolist *p =
> > -                     rb_entry(rb, typeof(*p), node);
> > +             struct i915_priolist *p = rb_entry(rb, typeof(*p), node);
> > +             int i;
> >   
> > -             list_for_each_entry(rq, &p->requests, sched.link) {
> > +             priolist_for_each_request(rq, p, i) {
> >                       if (count++ < MAX_REQUESTS_TO_SHOW - 1)
> >                               print_request(m, rq, "\t\tQ ");
> >                       else
> > diff --git a/drivers/gpu/drm/i915/intel_guc_submission.c b/drivers/gpu/drm/i915/intel_guc_submission.c
> > index 6f693ef62c64..8531bd917ec3 100644
> > --- a/drivers/gpu/drm/i915/intel_guc_submission.c
> > +++ b/drivers/gpu/drm/i915/intel_guc_submission.c
> > @@ -726,30 +726,28 @@ static bool __guc_dequeue(struct intel_engine_cs *engine)
> >       while ((rb = rb_first_cached(&execlists->queue))) {
> >               struct i915_priolist *p = to_priolist(rb);
> >               struct i915_request *rq, *rn;
> > +             int i;
> >   
> > -             list_for_each_entry_safe(rq, rn, &p->requests, sched.link) {
> > +             priolist_for_each_request_consume(rq, rn, p, i) {
> 
> Hm consumed clears the bitmask every time, but we are not certain yet we 
> will consume the request.

We clear it after the loop, so when we take the early break the bitmask
is unaffected.

> >                       if (last && rq->hw_context != last->hw_context) {
> > -                             if (port == last_port) {
> > -                                     __list_del_many(&p->requests,
> > -                                                     &rq->sched.link);
> > +                             if (port == last_port)
> >                                       goto done;
> > -                             }
> >   
> >                               if (submit)
> >                                       port_assign(port, last);
> >                               port++;
> >                       }
> >   
> > -                     INIT_LIST_HEAD(&rq->sched.link);
> > +                     list_del_init(&rq->sched.link);
> >   
> >                       __i915_request_submit(rq);
> >                       trace_i915_request_in(rq, port_index(port, execlists));
> > +
> >                       last = rq;
> >                       submit = true;
> >               }
> >   
> >               rb_erase_cached(&p->node, &execlists->queue);
> > -             INIT_LIST_HEAD(&p->requests);
> >               if (p->priority != I915_PRIORITY_NORMAL)
> >                       kmem_cache_free(engine->i915->priorities, p);
> >       }
> > diff --git a/drivers/gpu/drm/i915/intel_lrc.c b/drivers/gpu/drm/i915/intel_lrc.c
> > index e8de250c3413..aeae82b5223c 100644
> > --- a/drivers/gpu/drm/i915/intel_lrc.c
> > +++ b/drivers/gpu/drm/i915/intel_lrc.c
> > @@ -259,14 +259,49 @@ intel_lr_context_descriptor_update(struct i915_gem_context *ctx,
> >       ce->lrc_desc = desc;
> >   }
> >   
> > -static struct i915_priolist *
> > +static void assert_priolists(struct intel_engine_execlists * const execlists,
> > +                          int queue_priority)
> > +{
> > +     struct rb_node *rb;
> > +     int last_prio, i;
> > +
> > +     if (!IS_ENABLED(CONFIG_DRM_I915_DEBUG_GEM))
> > +             return;
> > +
> > +     GEM_BUG_ON(rb_first_cached(&execlists->queue) !=
> > +                rb_first(&execlists->queue.rb_root));
> > +
> > +     last_prio = (queue_priority >> I915_USER_PRIORITY_SHIFT) + 1;
>  >
> > +     for (rb = rb_first_cached(&execlists->queue); rb; rb = rb_next(rb)) {
> > +             struct i915_priolist *p = to_priolist(rb);
> > +
> > +             GEM_BUG_ON(p->priority >= last_prio);
> > +             last_prio = p->priority;
> > +
> > +             GEM_BUG_ON(!p->used);
> > +             for (i = 0; i < ARRAY_SIZE(p->requests); i++) {
> > +                     if (list_empty(&p->requests[i]))
> > +                             continue;
> 
> Asserting that bitmask slot is not set if list is empty is not interesting?

We can have the bit set with an empty list after request promotion. That
seemed like the reasonable tradeoff rather than have a more complicated
check after each list_move to decide if the old bit needs to be cleared.

> > +
> > +                     GEM_BUG_ON(!(p->used & BIT(i)));
> > +             }
> > +     }
> 
> In general, won't this be a tiny bit too expensive on debug build and 
> with priority/rq stress tests? To walk absolutely all priority levels 
> times bucket count on every lookup and dequeue. Or maybe we don't have 
> any tests at the moment which instantiate that many levels.

Worth it though. Whenever we abuse the system, our ability to catch
errors depends on our sanitychecks. I thought this was worth having the
reassurance that it worked :)

> > +}
> > +
> > +static struct list_head *
> >   lookup_priolist(struct intel_engine_cs *engine, int prio)
> >   {
> >       struct intel_engine_execlists * const execlists = &engine->execlists;
> >       struct i915_priolist *p;
> >       struct rb_node **parent, *rb;
> >       bool first = true;
> > +     int idx, i;
> > +
> > +     assert_priolists(execlists, INT_MAX);
> >   
> > +     /* buckets sorted from highest [in slot 0] to lowest priority */
> > +     idx = I915_PRIORITY_COUNT - (prio & ~I915_PRIORITY_MASK) - 1;
> 
> 0 - (prio & ~0) - 1 = -prio - 1, hm..?
> 
> Hm no.. I915_PRIORITY_COUNT with zero reserved internal bits is actually 
> 1.  So 1 - (prio & 0x0) - 1 = 0, ok phew..
> 
> Unintuitive for the count to be one with no reserved bits? Although it 
> is only a temporary stage..

Would it be preferrable to use "~MASK - (prio & ~MASK)"?

Iirc, mask is only used for internal bit. I followed the pattern of
page_mask but I think it would simply a bit to use INTERNAL_MASK.
-Chris