[Intel-gfx] [PATCH v3 2/2] drm/i915: add syncobj timeline support

Thu Aug 1 09:01:44 UTC 2019

On 01/08/2019 11:08, Chris Wilson wrote:
> Quoting Lionel Landwerlin (2019-08-01 08:43:24)
>> On 31/07/2019 23:03, Chris Wilson wrote:
>>> Quoting Lionel Landwerlin (2019-07-31 15:07:33)
>>> I think I have convinced myself that with the split between wait before,
>>> signal after combined with the rule that seqno point along the syncobj
>>> are monotonic, you should not be able to generate an AB-BA deadlock
>>> between concurrent clients.
>>>
>>> Except that we need to fixup drm_syncobj_add_point() to actually apply
>>> that rule. Throwing an error and leaving the client stuck is less than
>>> ideal, we can't even recover with a GPU reset, and as they are native
>>> fences we don't employ a failsafe timer.
>>
>> Unfortunately we're not the only user of add_point() and amdgpu doesn't
>> want it to fail.
> It's dangerously exposing the deadlock from incorrect seqno ordering to
> userspace. We should be able to hit that DRM_ERROR() in testing, and be
> able to detect the out-of-sequence timeline.
>
>> Best I can come up with is take the syncobj lock when signaling in
>> get_timeline_fence_array() do the check there, unlock in __free_fence_array.
> I think the intermediate step is a "safe" version of add_point that
> doesn't leave the timeline broken. That still leaves a glitch visible to
> userspace, but it should not deadlock.

Right, sounds doable. What happens in execbuf after add_point() fails?

We've already handed the request to the scheduler and potentially it 
might be running already.

>
> The way I was looking at it was to reserve a placeholder syncpt before
> building the request and allow replacing the placeholder afterwards.

That sounds fairly tricky to get right :(

The fence stuff is already full of magic.

>
> If we abort the submission part way, we leave the placeholder in the
> timeline to be replaced later, or subsumed by a later syncpt.
>
>>> Doesn't WAIT_FOR_SUBMIT throw a spanner in the works and allow for
>>> deadlocks?
>>
>> Hm... I can't see how.
>>
>> Unless you mean clients could deadlock them themselves the same way it
>> would using 2 pthread_mutex and having 2 threads trying to lock both
>> mutexes in opposite order.
> Yes.
>   
>> Is it that the kernel's problem?
> It becomes a problem for us as it ties up kernel resources that we can
> not reap currently. Userspace is not meant to be able to break the
> kernel on a whim.  Even futexes are robust. ;)
> -Chris
>

Thanks,

-Lionel