[Intel-gfx] [PATCH v3 2/2] drm/i915: add syncobj timeline support

Chris Wilson chris at chris-wilson.co.uk
Thu Aug 1 10:18:12 UTC 2019


Quoting Lionel Landwerlin (2019-08-01 10:37:23)
> On 01/08/2019 12:22, Chris Wilson wrote:
> > Quoting Lionel Landwerlin (2019-08-01 10:01:44)
> >> On 01/08/2019 11:08, Chris Wilson wrote:
> >>> Quoting Lionel Landwerlin (2019-08-01 08:43:24)
> >>>> On 31/07/2019 23:03, Chris Wilson wrote:
> >>>>> Quoting Lionel Landwerlin (2019-07-31 15:07:33)
> >>>>> I think I have convinced myself that with the split between wait before,
> >>>>> signal after combined with the rule that seqno point along the syncobj
> >>>>> are monotonic, you should not be able to generate an AB-BA deadlock
> >>>>> between concurrent clients.
> >>>>>
> >>>>> Except that we need to fixup drm_syncobj_add_point() to actually apply
> >>>>> that rule. Throwing an error and leaving the client stuck is less than
> >>>>> ideal, we can't even recover with a GPU reset, and as they are native
> >>>>> fences we don't employ a failsafe timer.
> >>>> Unfortunately we're not the only user of add_point() and amdgpu doesn't
> >>>> want it to fail.
> >>> It's dangerously exposing the deadlock from incorrect seqno ordering to
> >>> userspace. We should be able to hit that DRM_ERROR() in testing, and be
> >>> able to detect the out-of-sequence timeline.
> >>>
> >>>> Best I can come up with is take the syncobj lock when signaling in
> >>>> get_timeline_fence_array() do the check there, unlock in __free_fence_array.
> >>> I think the intermediate step is a "safe" version of add_point that
> >>> doesn't leave the timeline broken. That still leaves a glitch visible to
> >>> userspace, but it should not deadlock.
> >>
> >> Right, sounds doable. What happens in execbuf after add_point() fails?
> >>
> >> We've already handed the request to the scheduler and potentially it
> >> might be running already.
> > Right, at present we've already submitted the request, so the batch will
> > be run and fence will be signaled. The failure to add to it the
> > timeline means that someone else already submitted a later fence and
> > some third party is using that syncpt instead of ours for their
> > dependency. So that third party will be delayed, but so long as they kept
> > their dependencies in order it should just be a delay.
> 
> 
> But should we return an error to userspace?

No, the submission itself hasn't failed and the result of the batch and
its fence will be true. It's a case where userspace lost a race with
itself -- if there was a non-error warning flag, maybe. If you keep the
language loose that a timeline semaphore wait will be after its syncpt has
been signaled, but not necessarily before the next syncpt is signaled,
unless explicitly ordered by the client.
-Chris


More information about the Intel-gfx mailing list