Fence, timeline and android sync points

Christian König deathsimple at vodafone.de
Thu Aug 14 04:53:47 PDT 2014

> But because of driver differences I can't implement it as a straight wait queue. Some drivers may not have a reliable interrupt, so they need a custom wait function. (qxl)
> Some may need to do extra flushing to get fences signaled (vmwgfx), others need some locking to protect against gpu lockup races (radeon, i915??).  And nouveau
> doesn't use wait queues, but rolls its own (nouveau).
But when all those drivers need a special wait function how can you 
still justify the common callback when a fence is signaled?

If I understood it right the use case for this was waiting for any fence 
of a list of fences from multiple drivers, but if each driver needs 
special handling how for it's wait how can that work reliable?


Am 14.08.2014 um 11:15 schrieb Maarten Lankhorst:
> Op 13-08-14 om 19:07 schreef Jerome Glisse:
>> On Wed, Aug 13, 2014 at 05:54:20PM +0200, Daniel Vetter wrote:
>>> On Wed, Aug 13, 2014 at 09:36:04AM -0400, Jerome Glisse wrote:
>>>> On Wed, Aug 13, 2014 at 10:28:22AM +0200, Daniel Vetter wrote:
>>>>> On Tue, Aug 12, 2014 at 06:13:41PM -0400, Jerome Glisse wrote:
>>>>>> Hi,
>>>>>> So i want over the whole fence and sync point stuff as it's becoming a pressing
>>>>>> issue. I think we first need to agree on what is the problem we want to solve
>>>>>> and what would be the requirements to solve it.
>>>>>> Problem :
>>>>>>    Explicit synchronization btw different hardware block over a buffer object.
>>>>>> Requirements :
>>>>>>    Share common infrastructure.
>>>>>>    Allow optimal hardware command stream scheduling accross hardware block.
>>>>>>    Allow android sync point to be implemented on top of it.
>>>>>>    Handle/acknowledge exception (like good old gpu lockup).
>>>>>>    Minimize driver changes.
>>>>>> Glossary :
>>>>>>    hardware timeline: timeline bound to a specific hardware block.
>>>>>>    pipeline timeline: timeline bound to a userspace rendering pipeline, each
>>>>>>                       point on that timeline can be a composite of several
>>>>>>                       different hardware pipeline point.
>>>>>>    pipeline: abstract object representing userspace application graphic pipeline
>>>>>>              of each of the application graphic operations.
>>>>>>    fence: specific point in a timeline where synchronization needs to happen.
>>>>>> So now, current include/linux/fence.h implementation is i believe missing the
>>>>>> objective by confusing hardware and pipeline timeline and by bolting fence to
>>>>>> buffer object while what is really needed is true and proper timeline for both
>>>>>> hardware and pipeline. But before going further down that road let me look at
>>>>>> things and explain how i see them.
>>>>> fences can be used free-standing and no one forces you to integrate them
>>>>> with buffers. We actually plan to go this way with the intel svm stuff.
>>>>> Ofc for dma-buf the plan is to synchronize using such fences, but that's
>>>>> somewhat orthogonal I think. At least you only talk about fences and
>>>>> timeline and not dma-buf here.
>>>>>> Current ttm fence have one and a sole purpose, allow synchronization for buffer
>>>>>> object move even thought some driver like radeon slightly abuse it and use them
>>>>>> for things like lockup detection.
>>>>>> The new fence want to expose an api that would allow some implementation of a
>>>>>> timeline. For that it introduces callback and some hard requirement on what the
>>>>>> driver have to expose :
>>>>>>    enable_signaling
>>>>>>    [signaled]
>>>>>>    wait
>>>>>> Each of those have to do work inside the driver to which the fence belongs and
>>>>>> each of those can be call more or less from unexpected (with restriction like
>>>>>> outside irq) context. So we end up with thing like :
>>>>>>   Process 1              Process 2                   Process 3
>>>>>>   I_A_schedule(fence0)
>>>>>>                          CI_A_F_B_signaled(fence0)
>>>>>>                                                      I_A_signal(fence0)
>>>>>>                                                      CI_B_F_A_callback(fence0)
>>>>>>                          CI_A_F_B_wait(fence0)
>>>>>> Lexique:
>>>>>> I_x  in driver x (I_A == in driver A)
>>>>>> CI_x_F_y call in driver X from driver Y (CI_A_F_B call in driver A from driver B)
>>>>>> So this is an happy mess everyone call everyone and this bound to get messy.
>>>>>> Yes i know there is all kind of requirement on what happen once a fence is
>>>>>> signaled. But those requirement only looks like they are trying to atone any
>>>>>> mess that can happen from the whole callback dance.
>>>>>> While i was too seduced by the whole callback idea long time ago, i think it is
>>>>>> a highly dangerous path to take where the combinatorial of what could happen
>>>>>> are bound to explode with the increase in the number of players.
>>>>>> So now back to how to solve the problem we are trying to address. First i want
>>>>>> to make an observation, almost all GPU that exist today have a command ring
>>>>>> on to which userspace command buffer are executed and inside the command ring
>>>>>> you can do something like :
>>>>>>    if (condition) execute_command_buffer else skip_command_buffer
>>>>>> where condition is a simple expression (memory_address cop value)) with cop one
>>>>>> of the generic comparison (==, <, >, <=, >=). I think it is a safe assumption
>>>>>> that any gpu that slightly matter can do that. Those who can not should fix
>>>>>> there command ring processor.
>>>>>> With that in mind, i think proper solution is implementing timeline and having
>>>>>> fence be a timeline object with a way simpler api. For each hardware timeline
>>>>>> driver provide a system memory address at which the lastest signaled fence
>>>>>> sequence number can be read. Each fence object is uniquely associated with
>>>>>> both a hardware and a pipeline timeline. Each pipeline timeline have a wait
>>>>>> queue.
>>>>>> When scheduling something that require synchronization on a hardware timeline
>>>>>> a fence is created and associated with the pipeline timeline and hardware
>>>>>> timeline. Other hardware block that need to wait on a fence can use there
>>>>>> command ring conditional execution to directly check the fence sequence from
>>>>>> the other hw block so you do optimistic scheduling. If optimistic scheduling
>>>>>> fails (which would be reported by hw block specific solution and hidden) then
>>>>>> things can fallback to software cpu wait inside what could be considered the
>>>>>> kernel thread of the pipeline timeline.
>>>>>>  From api point of view there is no inter-driver call. All the driver needs to
>>>>>> do is wakeup the pipeline timeline wait_queue when things are signaled or
>>>>>> when things go sideway (gpu lockup).
>>>>>> So how to implement that with current driver ? Well easy. Currently we assume
>>>>>> implicit synchronization so all we need is an implicit pipeline timeline per
>>>>>> userspace process (note this do not prevent inter process synchronization).
>>>>>> Everytime a command buffer is submitted it is added to the implicit timeline
>>>>>> with the simple fence object :
>>>>>> struct fence {
>>>>>>    struct list_head   list_hwtimeline;
>>>>>>    struct list_head   list_pipetimeline;
>>>>>>    struct hw_timeline *hw_timeline;
>>>>>>    uint64_t           seq_num;
>>>>>>    work_t             timedout_work;
>>>>>>    void               *csdata;
>>>>>> };
>>>>>> So with set of helper function call by each of the driver command execution
>>>>>> ioctl you have the implicit timeline that is properly populated and each
>>>>>> dirver command execution get the dependency from the implicit timeline.
>>>>>> Of course to take full advantages of all flexibilities this could offer we
>>>>>> would need to allow userspace to create pipeline timeline and to schedule
>>>>>> against the pipeline timeline of there choice. We could create file for
>>>>>> each of the pipeline timeline and have file operation to wait/query
>>>>>> progress.
>>>>>> Note that the gpu lockup are considered exceptional event, the implicit
>>>>>> timeline will probably want to continue on other job on other hardware
>>>>>> block but the explicit one probably will want to decide wether to continue
>>>>>> or abort or retry without the fault hw block.
>>>>>> I realize i am late to the party and that i should have taken a serious
>>>>>> look at all this long time ago. I apologize for that and if you consider
>>>>>> this is to late then just ignore me modulo the big warning the crazyness
>>>>>> that callback will introduce an how bad things bound to happen. I am not
>>>>>> saying that bad things can not happen with what i propose just that
>>>>>> because everything happen inside the process context that is the one
>>>>>> asking/requiring synchronization there will be not interprocess kernel
>>>>>> callback (a callback that was registered by one process and that is call
>>>>>> inside another process time slice because fence signaling is happening
>>>>>> inside this other process time slice).
>>>>> So I read through it all and presuming I understand it correctly your
>>>>> proposal and what we currently have is about the same. The big difference
>>>>> is that you make a timeline a first-class object and move the callback
>>>>> queue from the fence to the timeline, which requires callers to check the
>>>>> fence/seqno/whatever themselves instead of pushing that responsibility to
>>>>> callers.
>>>> No, big difference is that there is no callback thus when waiting for a
>>>> fence you are either inside the process context that need to wait for it
>>>> or your inside a kernel thread process context. Which means in both case
>>>> you can do whatever you want. What i hate about the fence code as it is,
>>>> is the callback stuff, because you never know into which context fence
>>>> are signaled then you never know into which context callback are executed.
>>> Look at waitqueues a bit closer. They're implemented with callbacks ;-)
>>> The only difference is that you're allowed to have spurious wakeups and
>>> need to handle that somehow, so need a separate check function.
>> No this not how wait queue are implemented, ie wait queue do not callback a
>> random function from a random driver, it callback a limited set of function
>> from core linux kernel scheduler so that the process thread that was waiting
>> and out of the scheduler list is added back and marked as something that
>> should be schedule. Unless this part of the kernel drasticly changed for the
>> worse recently.
>> So this is fundamentaly different, fence as they are now allow random driver
>> callback and this is bound to get ugly this is bound to lead to one driver
>> doing something that seems innocuous but turn out to break heavoc when call
>> from some other driver function.
> No, really, look closer.
> fence_default_wait adds a callback to fence_default_wait_cb, which wakes up the waiting thread if the fence gets signaled.
> The callback calls wake_up_state, which calls try_to_wake up.
> default_wake_function, which is used by wait queues does something similar, it calls try_to_wake_up.
> Fence now has some additional checks, but originally it was implemented as a wait queue.
> But because of driver differences I can't implement it as a straight wait queue. Some drivers may not have a reliable interrupt, so they need a custom wait function. (qxl)
> Some may need to do extra flushing to get fences signaled (vmwgfx), others need some locking to protect against gpu lockup races (radeon, i915??).  And nouveau
> doesn't use wait queues, but rolls its own (nouveau).
> Fences also don't imply implicit sync, you can use explicit sync if you want.
> I posted a patch for this, but if you want to create an android userspace fence, call
> struct sync_fence *sync_fence_create_dma(const char *name, struct fence *pt)
> I'll try to get the patch for this in 3.18 through the dma-buf tree, i915 wants to use it.
> ~Maarten

More information about the dri-devel mailing list