[Intel-gfx] [PATCH v3] drm/i915: Lock signaler timeline while navigating

Wed Sep 18 13:44:03 UTC 2019

Quoting Tvrtko Ursulin (2019-09-18 14:38:06)
> 
> On 17/09/2019 16:39, Chris Wilson wrote:
> > As we need to take a walk back along the signaler timeline to find the
> > fence before upon which we want to wait, we need to lock that timeline
> > to prevent it being modified as we walk. Similarly, we also need to
> > acquire a reference to the earlier fence while it still exists!
> > 
> > Though we lack the correct locking today, we are saved by the
> > overarching struct_mutex -- but that protection is being removed.
> > 
> > v2: Tvrtko made me realise I was being lax and using annotations to
> > ignore the AB-BA deadlock from the timeline overlap. As it would be
> > possible to construct a second request that was using a semaphore from the
> > same timeline as ourselves, we could quite easily end up in a situation
> > where we deadlocked in our mutex waits. Avoid that by using a trylock
> > and falling back to a normal dma-fence await if contended.
> 
> I did not figure out the exact AB-BA, but even on a more basic level 
> without the deadlock, using trylock would mean false positives ie. 
> falling back to software signaling with random mutex contention on the 
> same timeline. From a performance perspective this sounds not end of the 
> world, just unfortunate, but from the design perspective it has me 
> running away scared.
> 
> I guess the AB-BA would be interdependent requests from two timelines 
> where the direction of dependency switches over across two pairs of 
> submissions.

Exactly.

Oh, you haven't seen the worst of it yet. This is a wonderful mess that
just keeps on getting worse as you dig in.

> > Signed-off-by: Chris Wilson <chris at chris-wilson.co.uk>
> > Cc: Tvrtko Ursulin <tvrtko.ursulin at linux.intel.com>
> > ---
> >   drivers/gpu/drm/i915/i915_request.c | 56 +++++++++++++++++++----------
> >   1 file changed, 37 insertions(+), 19 deletions(-)
> > 
> > diff --git a/drivers/gpu/drm/i915/i915_request.c b/drivers/gpu/drm/i915/i915_request.c
> > index f12358150097..4e861673fe5c 100644
> > --- a/drivers/gpu/drm/i915/i915_request.c
> > +++ b/drivers/gpu/drm/i915/i915_request.c
> > @@ -767,16 +767,35 @@ i915_request_create(struct intel_context *ce)
> >   static int
> >   i915_request_await_start(struct i915_request *rq, struct i915_request *signal)
> >   {
> > -     if (list_is_first(&signal->link, &signal->timeline->requests))
> > +     struct intel_timeline *tl = signal->timeline;
> > +     struct dma_fence *fence;
> > +     int err;
> > +
> > +     lockdep_assert_held(&rq->timeline->mutex);
> > +     GEM_BUG_ON(rq->timeline == signal->timeline);
> > +
> > +     if (list_is_first(&signal->link, &tl->requests))
> >               return 0;
> >   
> > -     signal = list_prev_entry(signal, link);
> > -     if (intel_timeline_sync_is_later(rq->timeline, &signal->fence))
> > +     if (!mutex_trylock(&tl->mutex))
> > +             return -EBUSY;
> > +
> > +     fence = NULL;
> > +     if (!list_is_first(&signal->link, &tl->requests))
> > +             fence = dma_fence_get(&list_prev_entry(signal, link)->fence);
> > +
> > +     mutex_unlock(&tl->mutex);
> > +     if (!fence)
> >               return 0;
> >   
> > -     return i915_sw_fence_await_dma_fence(&rq->submit,
> > -                                          &signal->fence, 0,
> > -                                          I915_FENCE_GFP);
> > +     err = 0;
> > +     if (!intel_timeline_sync_is_later(rq->timeline, fence))
> > +             err = i915_sw_fence_await_dma_fence(&rq->submit,
> > +                                                 fence, 0,
> > +                                                 I915_FENCE_GFP);
> > +     dma_fence_put(fence);
> > +
> > +     return err;
> >   }
> >   
> >   static intel_engine_mask_t
> > @@ -804,30 +823,24 @@ emit_semaphore_wait(struct i915_request *to,
> >   {
> >       u32 hwsp_offset;
> >       u32 *cs;
> > -     int err;
> >   
> >       GEM_BUG_ON(!from->timeline->has_initial_breadcrumb);
> >       GEM_BUG_ON(INTEL_GEN(to->i915) < 8);
> >   
> >       /* Just emit the first semaphore we see as request space is limited. */
> >       if (already_busywaiting(to) & from->engine->mask)
> > -             return i915_sw_fence_await_dma_fence(&to->submit,
> > -                                                  &from->fence, 0,
> > -                                                  I915_FENCE_GFP);
> > +             goto await_fence;
> >   
> > -     err = i915_request_await_start(to, from);
> > -     if (err < 0)
> > -             return err;
> > +     if (i915_request_await_start(to, from) < 0)
> > +             goto await_fence;
> 
> Does this need to be explicitly only on -EBUSY? Otherwise if 
> i915_sw_fence_await_dma_fence fails in i915_request_await_start code 
> jump to do the same i915_sw_fence_await_dma_fence.

The only one that concerned me is ignoring any potential EINTR. All the
other errors are transient and so trying again with the basic await is a
valid response (imo). Not bailing out due to a pending signal though is a
trade-off between our latency and their latency. To be honest, I like
the simpler code where we just pretend we never noticed the signal
unless we block again.
-Chris