Question on UAPI for fences

Sat Sep 13 17:32:27 PDT 2014

BTW, we can recycle fences in userspace just like we recycle buffers.
That should make the create/close overhead non-existent.

Marek

On Sat, Sep 13, 2014 at 2:25 PM, Christian König
<christian.koenig at amd.com> wrote:
>> Doing such combining and cleaning up fds as soon as they have been passed
>> on should keep each application's fd usage fairly small.
>
> Yeah, but this is exactly what we wanted to avoid internally because of the
> IOCTL overhead.
>
> And thinking more about it for our driver internal use we will definitely
> hit some limitations with the number of FDs in use and the overhead for
> creating and closing them. With the execution model we target for the long
> term we will need something like 10k fences per second or more.
>
> How about this: We use an identifier per client for the fence internally and
> when we need to somehow expose it to somebody else export it as sync point
> fd. Very similar to how we currently have GEM handles internally and when we
> need to expose them export a DMA_buf fd.
>
> Regards,
> Christian.
>
> Am 12.09.2014 um 18:38 schrieb John Harrison:
>
>> On Fri, Sep 12, 2014 at 05:58:09PM +0200, Christian König wrote:
>> > pass in a list of fences to wait for before beginning a command
>> > submission.
>>
>> The Android implementation has a mechanism for combining multiple sync
>> points into a brand new single sync pt. Thus APIs only ever need to take in
>> a single fd not a list of them. If the user wants an operation to wait for
>> multiple events to occur then it is up to them to request the combined
>> version first. They can then happily close the individual fds that have been
>> combined and only keep the one big one around. Indeed, even that fd can be
>> closed once it has been passed on to some other API.
>>
>> Doing such combining and cleaning up fds as soon as they have been passed
>> on should keep each application's fd usage fairly small.
>>
>>
>> On 12/09/2014 17:08, Christian König wrote:
>>>>
>>>> As Daniel said using fd is most likely the way we want to do it but this
>>>> remains vague.
>>>
>>> Separating the discussion if it should be an fd or not. Using an fd
>>> sounds fine to me in general, but I have some concerns as well.
>>>
>>> For example what was the maximum number of opened FDs per process again?
>>> Could that become a problem? etc...
>>>
>>> Please comment,
>>> Christian.
>>>
>>> Am 12.09.2014 um 18:03 schrieb Jerome Glisse:
>>>>
>>>> On Fri, Sep 12, 2014 at 05:58:09PM +0200, Christian König wrote:
>>>>>
>>>>> Am 12.09.2014 um 17:48 schrieb Jerome Glisse:
>>>>>>
>>>>>> On Fri, Sep 12, 2014 at 05:42:57PM +0200, Christian König wrote:
>>>>>>>
>>>>>>> Am 12.09.2014 um 17:33 schrieb Jerome Glisse:
>>>>>>>>
>>>>>>>> On Fri, Sep 12, 2014 at 11:25:12AM -0400, Alex Deucher wrote:
>>>>>>>>>
>>>>>>>>> On Fri, Sep 12, 2014 at 10:50 AM, Jerome Glisse
>>>>>>>>> <j.glisse at gmail.com> wrote:
>>>>>>>>>>
>>>>>>>>>> On Fri, Sep 12, 2014 at 04:43:44PM +0200, Daniel Vetter wrote:
>>>>>>>>>>>
>>>>>>>>>>> On Fri, Sep 12, 2014 at 4:09 PM, Daniel Vetter <daniel at ffwll.ch>
>>>>>>>>>>> wrote:
>>>>>>>>>>>>
>>>>>>>>>>>> On Fri, Sep 12, 2014 at 03:23:22PM +0200, Christian König wrote:
>>>>>>>>>>>>>
>>>>>>>>>>>>> Hello everyone,
>>>>>>>>>>>>>
>>>>>>>>>>>>> to allow concurrent buffer access by different engines beyond
>>>>>>>>>>>>> the multiple
>>>>>>>>>>>>> readers/single writer model that we currently use in radeon and
>>>>>>>>>>>>> other
>>>>>>>>>>>>> drivers we need some kind of synchonization object exposed to
>>>>>>>>>>>>> userspace.
>>>>>>>>>>>>>
>>>>>>>>>>>>> My initial patch set for this used (or rather abused) zero
>>>>>>>>>>>>> sized GEM buffers
>>>>>>>>>>>>> as fence handles. This is obviously isn't the best way of doing
>>>>>>>>>>>>> this (to
>>>>>>>>>>>>> much overhead, rather ugly etc...), Jerome commented on this
>>>>>>>>>>>>> accordingly.
>>>>>>>>>>>>>
>>>>>>>>>>>>> So what should a driver expose instead? Android sync points?
>>>>>>>>>>>>> Something else?
>>>>>>>>>>>>
>>>>>>>>>>>> I think actually exposing the struct fence objects as a fd,
>>>>>>>>>>>> using android
>>>>>>>>>>>> syncpts (or at least something compatible to it) is the way to
>>>>>>>>>>>> go. Problem
>>>>>>>>>>>> is that it's super-hard to get the android guys out of hiding
>>>>>>>>>>>> for this :(
>>>>>>>>>>>>
>>>>>>>>>>>> Adding a bunch of people in the hopes that something sticks.
>>>>>>>>>>>
>>>>>>>>>>> More people.
>>>>>>>>>>
>>>>>>>>>> Just to re-iterate, exposing such thing while still using command
>>>>>>>>>> stream
>>>>>>>>>> ioctl that use implicit synchronization is a waste and you can
>>>>>>>>>> only get
>>>>>>>>>> the lowest common denominator which is implicit synchronization.
>>>>>>>>>> So i do
>>>>>>>>>> not see the point of such api if you are not also adding a new cs
>>>>>>>>>> ioctl
>>>>>>>>>> with explicit contract that it does not do any kind of
>>>>>>>>>> synchronization
>>>>>>>>>> (it could be almost the exact same code modulo the do not wait for
>>>>>>>>>> previous cmd to complete).
>>>>>>>>>
>>>>>>>>> Our thinking was to allow explicit sync from a single process, but
>>>>>>>>> implicitly sync between processes.
>>>>>>>>
>>>>>>>> This is a BIG NAK if you are using the same ioctl as it would mean
>>>>>>>> you are
>>>>>>>> changing userspace API, well at least userspace expectation. Adding
>>>>>>>> a new
>>>>>>>> cs flag might do the trick but it should not be about inter-process,
>>>>>>>> or any
>>>>>>>> thing special, it's just implicit sync or no synchronization.
>>>>>>>> Converting
>>>>>>>> userspace is not that much of a big deal either, it can be broken
>>>>>>>> into
>>>>>>>> several step. Like mesa use explicit synchronization all time but
>>>>>>>> ddx use
>>>>>>>> implicit.
>>>>>>>
>>>>>>> The thinking here is that we need to be backward compatible for
>>>>>>> DRI2/3 and
>>>>>>> support all kind of different use cases like old DDX and new Mesa, or
>>>>>>> old
>>>>>>> Mesa and new DDX etc...
>>>>>>>
>>>>>>> So for my prototype if the kernel sees any access of a BO from two
>>>>>>> different
>>>>>>> clients it falls back to the old behavior of implicit synchronization
>>>>>>> of
>>>>>>> access to the same buffer object. That might not be the fastest
>>>>>>> approach,
>>>>>>> but is as far as I can see conservative and so should work under all
>>>>>>> conditions.
>>>>>>>
>>>>>>> Apart from that the planning so far was that we just hide this
>>>>>>> feature
>>>>>>> behind a couple of command submission flags and new chunks.
>>>>>>
>>>>>> Just to reproduce IRC discussion, i think it's a lot simpler and not
>>>>>> that
>>>>>> complex. For explicit cs ioctl you do not wait for any previous fence
>>>>>> of
>>>>>> any of the buffer referenced in the cs ioctl, but you still associate
>>>>>> a
>>>>>> new fence with all the buffer object referenced in the cs ioctl. So if
>>>>>> the
>>>>>> next ioctl is an implicit sync ioctl it will wait properly and
>>>>>> synchronize
>>>>>> properly with previous explicit cs ioctl. Hence you can easily have a
>>>>>> mix
>>>>>> in userspace thing is you only get benefit once enough of your
>>>>>> userspace
>>>>>> is using explicit.
>>>>>
>>>>> Yes, that's exactly what my patches currently implement.
>>>>>
>>>>> The only difference is that by current planning I implemented it as a
>>>>> per BO
>>>>> flag for the command submission, but that was just for testing. Having
>>>>> a
>>>>> single flag to switch between implicit and explicit synchronization for
>>>>> whole CS IOCTL would do equally well.
>>>>
>>>> Doing it per BO sounds bogus to me. But otherwise yes we are in
>>>> agreement.
>>>> As Daniel said using fd is most likely the way we want to do it but this
>>>> remains vague.
>>>>
>>>>>> Note that you still need a way to have explicit cs ioctl to wait on a
>>>>>> previos "explicit" fence so you need some api to expose fence per cs
>>>>>> submission.
>>>>>
>>>>> Exactly, that's what this mail thread is all about.
>>>>>
>>>>> As Daniel correctly noted you need something like a functionality to
>>>>> get a
>>>>> fence as the result of a command submission as well as pass in a list
>>>>> of
>>>>> fences to wait for before beginning a command submission.
>>>>>
>>>>> At least it looks like we are all on the same general line here, its
>>>>> just
>>>>> nobody has a good idea how the details should look like.
>>>>>
>>>>> Regards,
>>>>> Christian.
>>>>>
>>>>>> Cheers,
>>>>>> Jérôme
>>>>>>
>>>>>>> Regards,
>>>>>>> Christian.
>>>>>>>
>>>>>>>> Cheers,
>>>>>>>> Jérôme
>>>>>>>>
>>>>>>>>> Alex
>>>>>>>>>
>>>>>>>>>> Also one thing that the Android sync point does not have, AFAICT,
>>>>>>>>>> is a
>>>>>>>>>> way to schedule synchronization as part of a cs ioctl so cpu never
>>>>>>>>>> have
>>>>>>>>>> to be involve for cmd stream that deal only one gpu (assuming the
>>>>>>>>>> driver
>>>>>>>>>> and hw can do such trick).
>>>>>>>>>>
>>>>>>>>>> Cheers,
>>>>>>>>>> Jérôme
>>>>>>>>>>
>>>>>>>>>>> -Daniel
>>>>>>>>>>> --
>>>>>>>>>>> Daniel Vetter
>>>>>>>>>>> Software Engineer, Intel Corporation
>>>>>>>>>>> +41 (0) 79 365 57 48 - http://blog.ffwll.ch
>>>>>>>>>>
>>>>>>>>>> _______________________________________________
>>>>>>>>>> dri-devel mailing list
>>>>>>>>>> dri-devel at lists.freedesktop.org
>>>>>>>>>> http://lists.freedesktop.org/mailman/listinfo/dri-devel
>>>
>>>
>
> _______________________________________________
> dri-devel mailing list
> dri-devel at lists.freedesktop.org
> http://lists.freedesktop.org/mailman/listinfo/dri-devel