[Intel-gfx] [PATCH 10/11] drm/i915: Use HW semaphores for inter-engine synchronisation on gen8+

Thu Jan 31 13:39:50 UTC 2019

Quoting Tvrtko Ursulin (2019-01-31 13:19:31)
> 
> On 30/01/2019 02:19, Chris Wilson wrote:
> > Having introduced per-context seqno, we now have a means to identity
> > progress across the system without feel of rollback as befell the
> > global_seqno. That is we can program a MI_SEMAPHORE_WAIT operation in
> > advance of submission safe in the knowledge that our target seqno and
> > address is stable.
> > 
> > However, since we are telling the GPU to busy-spin on the target address
> > until it matches the signaling seqno, we only want to do so when we are
> > sure that busy-spin will be completed quickly. To achieve this we only
> > submit the request to HW once the signaler is itself executing (modulo
> > preemption causing us to wait longer), and we only do so for default and
> > above priority requests (so that idle priority tasks never themselves
> > hog the GPU waiting for others).
> 
> It could be milliseconds though. I think apart from media-bench saying 
> this is faster, we would need to look at performance per Watt as well.

All throughput measurements are substantially faster, as you would
expect, and inter-engine latency decreased. I would hope it would
powergate/rc6 the EU while the CS was spinning, but I don't know :)

> RING_SEMA_WAIT_POLL is a potential tunable as well. Not that I have an 
> idea how to tune it.
> 
> Eventually, do we dare adding this without a runtime switch? (There, I 
> mentioned the taboo.)

Yes :p

> What about signal mode and handling this via context switches?

That's 99% of the timeslicing scheduler right there -- the handling of
deferred work with the complication of it impacting other engines.

> > +static int
> > +emit_semaphore_wait(struct i915_request *to,
> > +                 struct i915_request *from,
> > +                 gfp_t gfp)
> > +{
> > +     u32 *cs;
> > +     int err;
> > +
> > +     GEM_BUG_ON(!from->timeline->has_initial_breadcrumb);
> > +
> > +     err = i915_timeline_read_lock(from->timeline, to);
> > +     if (err)
> > +             return err;
> > +
> > +     /*
> > +      * If we know our signaling request has started, we know that it
> > +      * must, at least, have passed its initial breadcrumb and that its
> > +      * seqno can only increase, therefore any change in its breadcrumb
> > +      * must indicate completion. By using a "not equal to start" compare
> > +      * we avoid the murky issue of how to handle seqno wraparound in an
> > +      * async environment (short answer, we must stop the world whenever
> > +      * any context wraps!) as the likelihood of missing one request then
> > +      * seeing the same start value for a new request is 1 in 2^31, and
> > +      * even then we know that the new request has started and is in
> > +      * progress, so we are sure it will complete soon enough (not to
> > +      * worry about).
> > +      */
> > +     if (i915_request_started(from)) {
> > +             cs = intel_ring_begin(to, 4);
> > +             if (IS_ERR(cs))
> > +                     return PTR_ERR(cs);
> > +
> > +             *cs++ = MI_SEMAPHORE_WAIT |
> > +                     MI_SEMAPHORE_GLOBAL_GTT |
> > +                     MI_SEMAPHORE_POLL |
> > +                     MI_SEMAPHORE_SAD_NEQ_SDD;
> > +             *cs++ = from->fence.seqno - 1;
> > +             *cs++ = from->timeline->hwsp_offset;
> > +             *cs++ = 0;
> > +
> > +             intel_ring_advance(to, cs);
> > +     } else {
> > +             int err;
> > +
> > +             err = i915_request_await_execution(to, from, gfp);
> > +             if (err)
> > +                     return err;
> > +
> > +             cs = intel_ring_begin(to, 4);
> > +             if (IS_ERR(cs))
> > +                     return PTR_ERR(cs);
> > +
> > +             /*
> > +              * Using greater-than-or-equal here means we have to worry
> > +              * about seqno wraparound. To side step that issue, we swap
> > +              * the timeline HWSP upon wrapping, so that everyone listening
> > +              * for the old (pre-wrap) values do not see the much smaller
> > +              * (post-wrap) values than they were expecting (and so wait
> > +              * forever).
> > +              */
> > +             *cs++ = MI_SEMAPHORE_WAIT |
> > +                     MI_SEMAPHORE_GLOBAL_GTT |
> > +                     MI_SEMAPHORE_POLL |
> > +                     MI_SEMAPHORE_SAD_GTE_SDD;
> > +             *cs++ = from->fence.seqno;
> > +             *cs++ = from->timeline->hwsp_offset;
> > +             *cs++ = 0;
> > +
> > +             intel_ring_advance(to, cs);
> > +     }
> 
> Would it not work to have a single path which emits the wait on NEQ 
> from->fence.seqno - 1, just i915_request_await_execution conditional on 
> i915_request_started?
> 
> In the !started case, having added the await, we would know the 
> semaphore wait would not run until after the dependency has started, and 
> NEQ would be true when it completes. The same as the above started path.

We may have previously submitted the signaler in a very long queue to
its engine so cannot determine its position, in which case we could
sample a long time before it even begins. Even if we launch both
requests on the different engines at the same time, we could sample
before the started semaphore.

I should remove the current NEQ path, it was before I committed myself
to handling the HWSP across wraparounds, and is now just needless
complication.
-Chris