[Intel-gfx] [RFC] drm/i915: Add sync framework support to execbuff IOCTL

Mon Jul 6 08:21:28 PDT 2015

On 07/06/2015 04:12 PM, Daniel Vetter wrote:
> On Mon, Jul 06, 2015 at 03:46:49PM +0100, Tvrtko Ursulin wrote:
>> On 07/06/2015 03:26 PM, John Harrison wrote:
>>> On 06/07/2015 14:59, Daniel Vetter wrote:
>>>> On Mon, Jul 06, 2015 at 01:58:25PM +0100, John Harrison wrote:
>>>>> On 06/07/2015 10:29, Daniel Vetter wrote:
>>>>>> On Fri, Jul 03, 2015 at 12:17:33PM +0100, Tvrtko Ursulin wrote:
>>>>>>> On 07/02/2015 04:55 PM, Chris Wilson wrote:
>>>>>>>> It would be nice if we could reuse one seqno both for
>>>>>>>> internal/external
>>>>>>>> fences. If you need to expose a fence ordering within a timeline
>>>>>>>> that is
>>>>>>>> based on the creation stamp rather than execution stamp, it seems
>>>>>>>> like
>>>>>>>> we could just add such a stamp when creating the sync_pt and not
>>>>>>>> worry
>>>>>>>> about its relationship to the execution seqno.
>>>>>>>>
>>>>>>>> Doing so does expose that requests are reordered to userspace
>>>>>>>> since the
>>>>>>>> signalling timeline is not the same as userspace's ordered
>>>>>>>> timeline. Not
>>>>>>>> sure if that is a problem or not.
>>>>>>>>
>>>>>>>> Afaict the sync uapi is based on waiting for all of a set of
>>>>>>>> fences to
>>>>>>>> retire. It doesn't seem to rely on fence ordering (that is knowing
>>>>>>>> that
>>>>>>>> fence A will signal before fence B so it need only wait on fence B).
>>>>>>>>
>>>>>>>> Here's hoping that we can have both simplicity and efficiency...
>>>>>>> Jumping in with not even perfect understanding of everything here -
>>>>>>> but
>>>>>>> timeline business has always been confusing me. There is nothing in
>>>>>>> the
>>>>>>> uapi which needs it afaics and iirc there was some discussion at
>>>>>>> the time
>>>>>>> Jesse floated his patches that it can be removed. Based on that when I
>>>>>>> squashed his patches and ported them on top of John's request to fence
>>>>>>> conversion it ended up something like the below (manually edited a
>>>>>>> bit to
>>>>>>> be less noisy and some prep patches omitted):
>>>>>>>
>>>>>>> This implements the ioctl based uapi and indeed seqnos are not
>>>>>>> actually
>>>>>>> used in waits. So is this insufficient for some reason? (Other that it
>>>>>>> does not implement the input fence side of things.)
>>>>>> Yeah android syncpt on top of struct fence embedded int i915 request is
>>>>>> what I'd have expected.
>>>>> The thing I'm not happy with in that plan is that it leaves the kernel
>>>>> driver at the mercy of user land applications. If we return a fence
>>>>> object
>>>>> to user land via a file descriptor (or indeed any other mechanism)
>>>>> then that
>>>>> fence object must be locked until user land closes the file. If the
>>>>> fence
>>>>> object is the one embedded within our request structure then that
>>>>> means user
>>>>> land is effectively locking our request structure. Given that more
>>>>> and more
>>>>> stuff is being attached to the request, that could be a fair bit of
>>>>> memory
>>>>> tied up that we can do nothing about. E.g. if a rogue/buggy application
>>>>> requests a fence be returned for every batch buffer submitted but never
>>>>> closes them. Whereas, if we go the route of a separate fence object
>>>>> specifically for user land then they can leak them like a sieve and
>>>>> we won't
>>>>> really care so much.
>>>> Userspace can exhaust kernel allocations, that's nothing new. And if we
>>>> keep it userspace simply needs to leak a few more fence fds than if
>>>> there's a bit more data attached to it.
>>>>
>>>> The solution to this problem is to have a mem cgroup limit set. No
>>>> need to
>>>> complicate our kernel code.
>>>
>>> There is still the extra complication that request unreferencing cannot
>>> require any kind of mutex lock if we are allowing it to happen from
>>> outside of the driver. That means the unreference callback must move the
>>> request to a 'please clean me later' list, schedule a worker thread to
>>> run, and thus do the clean up asynchronously.
>>
>> For this particular issue my solution was to extend the sync_fence
>> constructor to take a mutex and store it inside the object. Then at
>> destruction time, which happens at sync_fd->f_ops->release() time, it is
>> just a matter of calling kref_put_mutex instead of kref_put.
>>
>> Seemed to work under some quick testing but that is as much as I did back
>> then.
>
> The problem is that it doesn't scale since everyone wants some other kind
> of mutex to serialize the final kref_put. If something is supposed to be
> cross-subsystem/driver (which is the case for fences) then we really can't
> do that kind of leaky locking design. Imo we should have a kref_put_mutex
> considered harmful sign somewhere ...

I get the argument about everything wanting to add their own not 
scaling, but don't tie with the leaky comment? Also mutex is a pretty 
standard thing, especially since kref_put_mutex. :D If you look at it 
from that angle it kind of just exposes to the super class what the base 
class can do.

> If you have weak references somewhere and need to prevent the object from
> disappearing untimely while chasing that weak reference then imo the
> better design pattern is to use kref_get_unless_zero. If you need the
> serialization the mutex provides for some other reason (someone is only
> hodling the mutex instead of grabbing a proper refernce when they really
> should grab one) then your refcounting scheme probably needs another kind
> of fixup patch.

I don't see how weak references can work since if the request goes 
information is lost, unless stored somewhere else.

Regards,

Tvrtko