[Intel-gfx] [PATCH 10/11] drm/i915: Use HW semaphores for inter-engine synchronisation on gen8+
Chris Wilson
chris at chris-wilson.co.uk
Thu Jan 31 13:39:50 UTC 2019
Quoting Tvrtko Ursulin (2019-01-31 13:19:31)
>
> On 30/01/2019 02:19, Chris Wilson wrote:
> > Having introduced per-context seqno, we now have a means to identity
> > progress across the system without feel of rollback as befell the
> > global_seqno. That is we can program a MI_SEMAPHORE_WAIT operation in
> > advance of submission safe in the knowledge that our target seqno and
> > address is stable.
> >
> > However, since we are telling the GPU to busy-spin on the target address
> > until it matches the signaling seqno, we only want to do so when we are
> > sure that busy-spin will be completed quickly. To achieve this we only
> > submit the request to HW once the signaler is itself executing (modulo
> > preemption causing us to wait longer), and we only do so for default and
> > above priority requests (so that idle priority tasks never themselves
> > hog the GPU waiting for others).
>
> It could be milliseconds though. I think apart from media-bench saying
> this is faster, we would need to look at performance per Watt as well.
All throughput measurements are substantially faster, as you would
expect, and inter-engine latency decreased. I would hope it would
powergate/rc6 the EU while the CS was spinning, but I don't know :)
> RING_SEMA_WAIT_POLL is a potential tunable as well. Not that I have an
> idea how to tune it.
>
> Eventually, do we dare adding this without a runtime switch? (There, I
> mentioned the taboo.)
Yes :p
> What about signal mode and handling this via context switches?
That's 99% of the timeslicing scheduler right there -- the handling of
deferred work with the complication of it impacting other engines.
> > +static int
> > +emit_semaphore_wait(struct i915_request *to,
> > + struct i915_request *from,
> > + gfp_t gfp)
> > +{
> > + u32 *cs;
> > + int err;
> > +
> > + GEM_BUG_ON(!from->timeline->has_initial_breadcrumb);
> > +
> > + err = i915_timeline_read_lock(from->timeline, to);
> > + if (err)
> > + return err;
> > +
> > + /*
> > + * If we know our signaling request has started, we know that it
> > + * must, at least, have passed its initial breadcrumb and that its
> > + * seqno can only increase, therefore any change in its breadcrumb
> > + * must indicate completion. By using a "not equal to start" compare
> > + * we avoid the murky issue of how to handle seqno wraparound in an
> > + * async environment (short answer, we must stop the world whenever
> > + * any context wraps!) as the likelihood of missing one request then
> > + * seeing the same start value for a new request is 1 in 2^31, and
> > + * even then we know that the new request has started and is in
> > + * progress, so we are sure it will complete soon enough (not to
> > + * worry about).
> > + */
> > + if (i915_request_started(from)) {
> > + cs = intel_ring_begin(to, 4);
> > + if (IS_ERR(cs))
> > + return PTR_ERR(cs);
> > +
> > + *cs++ = MI_SEMAPHORE_WAIT |
> > + MI_SEMAPHORE_GLOBAL_GTT |
> > + MI_SEMAPHORE_POLL |
> > + MI_SEMAPHORE_SAD_NEQ_SDD;
> > + *cs++ = from->fence.seqno - 1;
> > + *cs++ = from->timeline->hwsp_offset;
> > + *cs++ = 0;
> > +
> > + intel_ring_advance(to, cs);
> > + } else {
> > + int err;
> > +
> > + err = i915_request_await_execution(to, from, gfp);
> > + if (err)
> > + return err;
> > +
> > + cs = intel_ring_begin(to, 4);
> > + if (IS_ERR(cs))
> > + return PTR_ERR(cs);
> > +
> > + /*
> > + * Using greater-than-or-equal here means we have to worry
> > + * about seqno wraparound. To side step that issue, we swap
> > + * the timeline HWSP upon wrapping, so that everyone listening
> > + * for the old (pre-wrap) values do not see the much smaller
> > + * (post-wrap) values than they were expecting (and so wait
> > + * forever).
> > + */
> > + *cs++ = MI_SEMAPHORE_WAIT |
> > + MI_SEMAPHORE_GLOBAL_GTT |
> > + MI_SEMAPHORE_POLL |
> > + MI_SEMAPHORE_SAD_GTE_SDD;
> > + *cs++ = from->fence.seqno;
> > + *cs++ = from->timeline->hwsp_offset;
> > + *cs++ = 0;
> > +
> > + intel_ring_advance(to, cs);
> > + }
>
> Would it not work to have a single path which emits the wait on NEQ
> from->fence.seqno - 1, just i915_request_await_execution conditional on
> i915_request_started?
>
> In the !started case, having added the await, we would know the
> semaphore wait would not run until after the dependency has started, and
> NEQ would be true when it completes. The same as the above started path.
We may have previously submitted the signaler in a very long queue to
its engine so cannot determine its position, in which case we could
sample a long time before it even begins. Even if we launch both
requests on the different engines at the same time, we could sample
before the started semaphore.
I should remove the current NEQ path, it was before I committed myself
to handling the HWSP across wraparounds, and is now just needless
complication.
-Chris
More information about the Intel-gfx
mailing list