[Intel-gfx] [PATCH 10/11] drm/i915: Use HW semaphores for inter-engine synchronisation on gen8+

Thu Jan 31 17:21:39 UTC 2019

Quoting Tvrtko Ursulin (2019-01-31 13:19:31)
> 
> On 30/01/2019 02:19, Chris Wilson wrote:
> > Having introduced per-context seqno, we now have a means to identity
> > progress across the system without feel of rollback as befell the
> > global_seqno. That is we can program a MI_SEMAPHORE_WAIT operation in
> > advance of submission safe in the knowledge that our target seqno and
> > address is stable.
> > 
> > However, since we are telling the GPU to busy-spin on the target address
> > until it matches the signaling seqno, we only want to do so when we are
> > sure that busy-spin will be completed quickly. To achieve this we only
> > submit the request to HW once the signaler is itself executing (modulo
> > preemption causing us to wait longer), and we only do so for default and
> > above priority requests (so that idle priority tasks never themselves
> > hog the GPU waiting for others).
> 
> It could be milliseconds though. I think apart from media-bench saying 
> this is faster, we would need to look at performance per Watt as well.
> 
> RING_SEMA_WAIT_POLL is a potential tunable as well. Not that I have an 
> idea how to tune it.
> 
> Eventually, do we dare adding this without a runtime switch? (There, I 
> mentioned the taboo.)

Yes, we could make it a context setparam. I used priority here as
arguing that idle workloads don't want the extra power draw makes sense.

Downside of making it opt-in, nobody benefits. Still it's pretty limited
to media workloads at the moment (who else uses multiple rings atm), but
even there reducing latency for desktop video is justifiable imo.

(Now having said that, I should go out and find a video player to
benchmark... Maybe we can demonstrate reduced frame drop for Kodi. If I
say "Kodi, Kodi, Kodi" I summon a Kodi dev right?)

Downside of making it opt-out: everybody gets to experience our bugs,
and the onus is on us in making the right choice.

> > @@ -605,6 +606,17 @@ static bool can_merge_rq(const struct i915_request *prev,
> >   {
> >       GEM_BUG_ON(!assert_priority_queue(prev, next));
> >   
> > +     /*
> > +      * To avoid AB-BA deadlocks, we simply restrict ourselves to only
> > +      * submitting one semaphore (think HW spinlock) to HW at a time. This
> > +      * prevents the execution callback on a later sempahore from being
> > +      * queued on another engine, so no cycle can be formed. Preemption
> > +      * rules should mean that if this semaphore is preempted, its
> > +      * dependency chain is preserved and suitably promoted via PI.
> > +      */
> > +     if (prev->sched.semaphore && !i915_request_started(prev))
> > +             return false;

The other way I was thinking we could solve this is to move the
execute_cb from i915_request_submit until we actually insert the request
in ELSP[0] (or do the promotion from ELSP[1]).

I don't like either much. I don't really want to walk the list of
requests for port0 checking for execute_cb, but I don't also like
arbitrary splitting contexts (however, there seems to be reasons to do
that anyway).

It all depends on how fast we can service CS interrupts, and that needs
to always be fast. :|
-Chris