Fence, timeline and android sync points

Tue Aug 12 18:23:54 PDT 2014

On Tue, Aug 12, 2014 at 06:13:41PM -0400, Jerome Glisse wrote:
> Hi,
> 
> So i want over the whole fence and sync point stuff as it's becoming a pressing
> issue. I think we first need to agree on what is the problem we want to solve
> and what would be the requirements to solve it.
> 
> Problem :
>   Explicit synchronization btw different hardware block over a buffer object.
> 
> Requirements :
>   Share common infrastructure.
>   Allow optimal hardware command stream scheduling accross hardware block.
>   Allow android sync point to be implemented on top of it.
>   Handle/acknowledge exception (like good old gpu lockup).
>   Minimize driver changes.
> 
> Glossary :
>   hardware timeline: timeline bound to a specific hardware block.
>   pipeline timeline: timeline bound to a userspace rendering pipeline, each
>                      point on that timeline can be a composite of several
>                      different hardware pipeline point.
>   pipeline: abstract object representing userspace application graphic pipeline
>             of each of the application graphic operations.
>   fence: specific point in a timeline where synchronization needs to happen.
> 
> 
> So now, current include/linux/fence.h implementation is i believe missing the
> objective by confusing hardware and pipeline timeline and by bolting fence to
> buffer object while what is really needed is true and proper timeline for both
> hardware and pipeline. But before going further down that road let me look at
> things and explain how i see them.
> 
> Current ttm fence have one and a sole purpose, allow synchronization for buffer
> object move even thought some driver like radeon slightly abuse it and use them
> for things like lockup detection.
> 
> The new fence want to expose an api that would allow some implementation of a
> timeline. For that it introduces callback and some hard requirement on what the
> driver have to expose :
>   enable_signaling
>   [signaled]
>   wait
> 
> Each of those have to do work inside the driver to which the fence belongs and
> each of those can be call more or less from unexpected (with restriction like
> outside irq) context. So we end up with thing like :
> 
>  Process 1              Process 2                   Process 3
>  I_A_schedule(fence0)
>                         CI_A_F_B_signaled(fence0)
>                                                     I_A_signal(fence0)
>                                                     CI_B_F_A_callback(fence0)
>                         CI_A_F_B_wait(fence0)
> Lexique:
> I_x  in driver x (I_A == in driver A)
> CI_x_F_y call in driver X from driver Y (CI_A_F_B call in driver A from driver B)
> 
> So this is an happy mess everyone call everyone and this bound to get messy.
> Yes i know there is all kind of requirement on what happen once a fence is
> signaled. But those requirement only looks like they are trying to atone any
> mess that can happen from the whole callback dance.
> 
> While i was too seduced by the whole callback idea long time ago, i think it is
> a highly dangerous path to take where the combinatorial of what could happen
> are bound to explode with the increase in the number of players.
> 
> 
> So now back to how to solve the problem we are trying to address. First i want
> to make an observation, almost all GPU that exist today have a command ring
> on to which userspace command buffer are executed and inside the command ring
> you can do something like :
> 
>   if (condition) execute_command_buffer else skip_command_buffer
> 
> where condition is a simple expression (memory_address cop value)) with cop one
> of the generic comparison (==, <, >, <=, >=). I think it is a safe assumption
> that any gpu that slightly matter can do that. Those who can not should fix
> there command ring processor.
> 
> 
> With that in mind, i think proper solution is implementing timeline and having
> fence be a timeline object with a way simpler api. For each hardware timeline
> driver provide a system memory address at which the lastest signaled fence
> sequence number can be read. Each fence object is uniquely associated with
> both a hardware and a pipeline timeline. Each pipeline timeline have a wait
> queue.
> 
> When scheduling something that require synchronization on a hardware timeline
> a fence is created and associated with the pipeline timeline and hardware
> timeline. Other hardware block that need to wait on a fence can use there
> command ring conditional execution to directly check the fence sequence from
> the other hw block so you do optimistic scheduling. If optimistic scheduling
> fails (which would be reported by hw block specific solution and hidden) then
> things can fallback to software cpu wait inside what could be considered the
> kernel thread of the pipeline timeline.
> 
> 
> From api point of view there is no inter-driver call. All the driver needs to
> do is wakeup the pipeline timeline wait_queue when things are signaled or
> when things go sideway (gpu lockup).
> 
> 
> So how to implement that with current driver ? Well easy. Currently we assume
> implicit synchronization so all we need is an implicit pipeline timeline per
> userspace process (note this do not prevent inter process synchronization).
> Everytime a command buffer is submitted it is added to the implicit timeline
> with the simple fence object :
> 
> struct fence {
>   struct list_head   list_hwtimeline;
>   struct list_head   list_pipetimeline;
>   struct hw_timeline *hw_timeline;
>   uint64_t           seq_num;
>   work_t             timedout_work;
>   void               *csdata;
> };
> 
> So with set of helper function call by each of the driver command execution
> ioctl you have the implicit timeline that is properly populated and each
> dirver command execution get the dependency from the implicit timeline.
> 
> 
> Of course to take full advantages of all flexibilities this could offer we
> would need to allow userspace to create pipeline timeline and to schedule
> against the pipeline timeline of there choice. We could create file for
> each of the pipeline timeline and have file operation to wait/query
> progress.
> 
> Note that the gpu lockup are considered exceptional event, the implicit
> timeline will probably want to continue on other job on other hardware
> block but the explicit one probably will want to decide wether to continue
> or abort or retry without the fault hw block.
> 
> 
> I realize i am late to the party and that i should have taken a serious
> look at all this long time ago. I apologize for that and if you consider
> this is to late then just ignore me modulo the big warning the crazyness
> that callback will introduce an how bad things bound to happen. I am not
> saying that bad things can not happen with what i propose just that
> because everything happen inside the process context that is the one
> asking/requiring synchronization there will be not interprocess kernel
> callback (a callback that was registered by one process and that is call
> inside another process time slice because fence signaling is happening
> inside this other process time slice).
> 
> 
> Pseudo code for explicitness :
> 
> drm_cs_ioctl_wrapper(struct drm_device *dev, void *data, struct file *filp)
> {
>    struct fence *dependency[16], *fence;
>    int m;
> 
>    m = timeline_schedule(filp->implicit_pipeline, dev->hw_pipeline,
>                          dependency, 16, &fence);
>    if (m < 0)
>      return m;
>    if (m >= 16) {
>        // alloc m and recall;
>    }
>    dev->cs_ioctl(dev, data, filp, dev->implicit_pipeline, dependency, fence);
> }
> 
> int timeline_schedule(ptimeline, hwtimeline, timeout,
>                        dependency, mdep, **fence)
> {
>    // allocate fence set hw_timeline and init work
>    // build up list of dependency by looking at list of pending fence in
>    // timeline
> }
> 
> 
> 
> // If device driver schedule job hopping for all dependency to be signaled then
> // it must also call this function with csdata being a copy of what needs to be
> // executed once all dependency are signaled
> void timeline_missed_schedule(timeline, fence, void *csdata)
> {
>    INITWORK(fence->work, timeline_missed_schedule_worker)
>    fence->csdata = csdata;
>    schedule_delayed_work(fence->work, default_timeout)
> }
> 
> void timeline_missed_schedule_worker(work)
> {
>    driver = driver_from_fence_hwtimeline(fence)
> 
>    // Make sure that each of the hwtimeline dependency will fire irq by
>    // calling a driver function.
>    timeline_wait_for_fence_dependency(fence);
>    driver->execute_cs(driver, fence);
> }
> 
> // This function is call by driver code that signal fence (could be call from
> // interrupt context). It is responsabilities of device driver to call that
> // function.
> void timeline_signal(hwtimeline)
> {
>   for_each_fence(fence, hwtimeline->fences, list_hwtimeline) {
>     wakeup(fence->pipetimeline->wait_queue);
>   }
> }

Btw as extra note, because of implicit timeline any shared object schedule on a
hw timeline must add a fence to all the implicit timeline where this object exist.

Also there is no need to have a fence pointer per object.

> 
> 
> Cheers,
> Jérôme