[Intel-gfx] [PATCH v3 2/2] drm/i915: add syncobj timeline support

Thu Aug 1 09:22:13 UTC 2019

Quoting Lionel Landwerlin (2019-08-01 10:01:44)
> On 01/08/2019 11:08, Chris Wilson wrote:
> > Quoting Lionel Landwerlin (2019-08-01 08:43:24)
> >> On 31/07/2019 23:03, Chris Wilson wrote:
> >>> Quoting Lionel Landwerlin (2019-07-31 15:07:33)
> >>> I think I have convinced myself that with the split between wait before,
> >>> signal after combined with the rule that seqno point along the syncobj
> >>> are monotonic, you should not be able to generate an AB-BA deadlock
> >>> between concurrent clients.
> >>>
> >>> Except that we need to fixup drm_syncobj_add_point() to actually apply
> >>> that rule. Throwing an error and leaving the client stuck is less than
> >>> ideal, we can't even recover with a GPU reset, and as they are native
> >>> fences we don't employ a failsafe timer.
> >>
> >> Unfortunately we're not the only user of add_point() and amdgpu doesn't
> >> want it to fail.
> > It's dangerously exposing the deadlock from incorrect seqno ordering to
> > userspace. We should be able to hit that DRM_ERROR() in testing, and be
> > able to detect the out-of-sequence timeline.
> >
> >> Best I can come up with is take the syncobj lock when signaling in
> >> get_timeline_fence_array() do the check there, unlock in __free_fence_array.
> > I think the intermediate step is a "safe" version of add_point that
> > doesn't leave the timeline broken. That still leaves a glitch visible to
> > userspace, but it should not deadlock.
> 
> 
> Right, sounds doable. What happens in execbuf after add_point() fails?
> 
> We've already handed the request to the scheduler and potentially it 
> might be running already.

Right, at present we've already submitted the request, so the batch will
be run and fence will be signaled. The failure to add to it the
timeline means that someone else already submitted a later fence and
some third party is using that syncpt instead of ours for their
dependency. So that third party will be delayed, but so long as they kept
their dependencies in order it should just be a delay.

The problem comes into play if we go ahead and insert our fence into the
dma_fence_chain out of order, breaking all the monotonic guarantees and
flipping the order of the fences. (The equivalent to taking mutexes out
of order.)

> > The way I was looking at it was to reserve a placeholder syncpt before
> > building the request and allow replacing the placeholder afterwards.
> 
> 
> That sounds fairly tricky to get right :(

Invasive, yeah, turns out one needs to start walking the fence chain
more carefully. The actual placeholder fence is pretty run of the mill
as far as dma-fence*.c goes, which is say...

> The fence stuff is already full of magic.

It's full of magic.
-Chris