Fence, timeline and android sync points

Wed Aug 13 08:54:20 PDT 2014

On Wed, Aug 13, 2014 at 09:36:04AM -0400, Jerome Glisse wrote:
> On Wed, Aug 13, 2014 at 10:28:22AM +0200, Daniel Vetter wrote:
> > On Tue, Aug 12, 2014 at 06:13:41PM -0400, Jerome Glisse wrote:
> > > Hi,
> > > 
> > > So i want over the whole fence and sync point stuff as it's becoming a pressing
> > > issue. I think we first need to agree on what is the problem we want to solve
> > > and what would be the requirements to solve it.
> > > 
> > > Problem :
> > >   Explicit synchronization btw different hardware block over a buffer object.
> > > 
> > > Requirements :
> > >   Share common infrastructure.
> > >   Allow optimal hardware command stream scheduling accross hardware block.
> > >   Allow android sync point to be implemented on top of it.
> > >   Handle/acknowledge exception (like good old gpu lockup).
> > >   Minimize driver changes.
> > > 
> > > Glossary :
> > >   hardware timeline: timeline bound to a specific hardware block.
> > >   pipeline timeline: timeline bound to a userspace rendering pipeline, each
> > >                      point on that timeline can be a composite of several
> > >                      different hardware pipeline point.
> > >   pipeline: abstract object representing userspace application graphic pipeline
> > >             of each of the application graphic operations.
> > >   fence: specific point in a timeline where synchronization needs to happen.
> > > 
> > > 
> > > So now, current include/linux/fence.h implementation is i believe missing the
> > > objective by confusing hardware and pipeline timeline and by bolting fence to
> > > buffer object while what is really needed is true and proper timeline for both
> > > hardware and pipeline. But before going further down that road let me look at
> > > things and explain how i see them.
> > 
> > fences can be used free-standing and no one forces you to integrate them
> > with buffers. We actually plan to go this way with the intel svm stuff.
> > Ofc for dma-buf the plan is to synchronize using such fences, but that's
> > somewhat orthogonal I think. At least you only talk about fences and
> > timeline and not dma-buf here.
> >  
> > > Current ttm fence have one and a sole purpose, allow synchronization for buffer
> > > object move even thought some driver like radeon slightly abuse it and use them
> > > for things like lockup detection.
> > > 
> > > The new fence want to expose an api that would allow some implementation of a
> > > timeline. For that it introduces callback and some hard requirement on what the
> > > driver have to expose :
> > >   enable_signaling
> > >   [signaled]
> > >   wait
> > > 
> > > Each of those have to do work inside the driver to which the fence belongs and
> > > each of those can be call more or less from unexpected (with restriction like
> > > outside irq) context. So we end up with thing like :
> > > 
> > >  Process 1              Process 2                   Process 3
> > >  I_A_schedule(fence0)
> > >                         CI_A_F_B_signaled(fence0)
> > >                                                     I_A_signal(fence0)
> > >                                                     CI_B_F_A_callback(fence0)
> > >                         CI_A_F_B_wait(fence0)
> > > Lexique:
> > > I_x  in driver x (I_A == in driver A)
> > > CI_x_F_y call in driver X from driver Y (CI_A_F_B call in driver A from driver B)
> > > 
> > > So this is an happy mess everyone call everyone and this bound to get messy.
> > > Yes i know there is all kind of requirement on what happen once a fence is
> > > signaled. But those requirement only looks like they are trying to atone any
> > > mess that can happen from the whole callback dance.
> > > 
> > > While i was too seduced by the whole callback idea long time ago, i think it is
> > > a highly dangerous path to take where the combinatorial of what could happen
> > > are bound to explode with the increase in the number of players.
> > > 
> > > 
> > > So now back to how to solve the problem we are trying to address. First i want
> > > to make an observation, almost all GPU that exist today have a command ring
> > > on to which userspace command buffer are executed and inside the command ring
> > > you can do something like :
> > > 
> > >   if (condition) execute_command_buffer else skip_command_buffer
> > > 
> > > where condition is a simple expression (memory_address cop value)) with cop one
> > > of the generic comparison (==, <, >, <=, >=). I think it is a safe assumption
> > > that any gpu that slightly matter can do that. Those who can not should fix
> > > there command ring processor.
> > > 
> > > 
> > > With that in mind, i think proper solution is implementing timeline and having
> > > fence be a timeline object with a way simpler api. For each hardware timeline
> > > driver provide a system memory address at which the lastest signaled fence
> > > sequence number can be read. Each fence object is uniquely associated with
> > > both a hardware and a pipeline timeline. Each pipeline timeline have a wait
> > > queue.
> > > 
> > > When scheduling something that require synchronization on a hardware timeline
> > > a fence is created and associated with the pipeline timeline and hardware
> > > timeline. Other hardware block that need to wait on a fence can use there
> > > command ring conditional execution to directly check the fence sequence from
> > > the other hw block so you do optimistic scheduling. If optimistic scheduling
> > > fails (which would be reported by hw block specific solution and hidden) then
> > > things can fallback to software cpu wait inside what could be considered the
> > > kernel thread of the pipeline timeline.
> > > 
> > > 
> > > From api point of view there is no inter-driver call. All the driver needs to
> > > do is wakeup the pipeline timeline wait_queue when things are signaled or
> > > when things go sideway (gpu lockup).
> > > 
> > > 
> > > So how to implement that with current driver ? Well easy. Currently we assume
> > > implicit synchronization so all we need is an implicit pipeline timeline per
> > > userspace process (note this do not prevent inter process synchronization).
> > > Everytime a command buffer is submitted it is added to the implicit timeline
> > > with the simple fence object :
> > > 
> > > struct fence {
> > >   struct list_head   list_hwtimeline;
> > >   struct list_head   list_pipetimeline;
> > >   struct hw_timeline *hw_timeline;
> > >   uint64_t           seq_num;
> > >   work_t             timedout_work;
> > >   void               *csdata;
> > > };
> > > 
> > > So with set of helper function call by each of the driver command execution
> > > ioctl you have the implicit timeline that is properly populated and each
> > > dirver command execution get the dependency from the implicit timeline.
> > > 
> > > 
> > > Of course to take full advantages of all flexibilities this could offer we
> > > would need to allow userspace to create pipeline timeline and to schedule
> > > against the pipeline timeline of there choice. We could create file for
> > > each of the pipeline timeline and have file operation to wait/query
> > > progress.
> > > 
> > > Note that the gpu lockup are considered exceptional event, the implicit
> > > timeline will probably want to continue on other job on other hardware
> > > block but the explicit one probably will want to decide wether to continue
> > > or abort or retry without the fault hw block.
> > > 
> > > 
> > > I realize i am late to the party and that i should have taken a serious
> > > look at all this long time ago. I apologize for that and if you consider
> > > this is to late then just ignore me modulo the big warning the crazyness
> > > that callback will introduce an how bad things bound to happen. I am not
> > > saying that bad things can not happen with what i propose just that
> > > because everything happen inside the process context that is the one
> > > asking/requiring synchronization there will be not interprocess kernel
> > > callback (a callback that was registered by one process and that is call
> > > inside another process time slice because fence signaling is happening
> > > inside this other process time slice).
> > 
> > So I read through it all and presuming I understand it correctly your
> > proposal and what we currently have is about the same. The big difference
> > is that you make a timeline a first-class object and move the callback
> > queue from the fence to the timeline, which requires callers to check the
> > fence/seqno/whatever themselves instead of pushing that responsibility to
> > callers.
> 
> No, big difference is that there is no callback thus when waiting for a
> fence you are either inside the process context that need to wait for it
> or your inside a kernel thread process context. Which means in both case
> you can do whatever you want. What i hate about the fence code as it is,
> is the callback stuff, because you never know into which context fence
> are signaled then you never know into which context callback are executed.

Look at waitqueues a bit closer. They're implemented with callbacks ;-)
The only difference is that you're allowed to have spurious wakeups and
need to handle that somehow, so need a separate check function.

> > If you actually mandate that the fence is just a seqno or similar which
> > can be read lockless then I could register my own special callback into
> > that waitqueue (other stuff than waking up threads is allowed) and from
> > hard-irq context check the seqno and readd my own callback if that's not
> > yet happened (that needs to go through some other context for hilarity).
> 
> Yes mandating a simple number that can be read from anywhere without lock,
> i am pretty sure all hw can write to system page and can write a value
> alongside their command buffer. So either you hw support reading and testing
> value either you can do it in atomic context right before scheduling.

Imo that's a step in the wrong direction since reading a bit of system
memory or checking a bit of irq-safe spinlock protected data or a register
shouldn't matter. You just arbitrarily disallow that. And allowing random
other kernel subsystems to read mmio or page mappings not under their
control is an idea that freaks /me/ out.

> > So from that pov (presuming I didn't miss anything) your proposal is
> > identical to what we have, minor some different color choices (like where
> > to place the callback queue).
> 
> No callback is the mantra here, and instead of bolting free living fence
> to buffer object, they are associated with timeline which means you do not
> need to go over all buffer object to know what you need to wait for.

Ok, then I guess I didn't understand that part of your the proposal. Can
you please elaborate a bit more how you want to synchronize mulitple
drivers accessing a dma-buf object and what piece of state we need to
associate to the dma-buf to make this happen?

Thanks, Daniel
-- 
Daniel Vetter
Software Engineer, Intel Corporation
+41 (0) 79 365 57 48 - http://blog.ffwll.ch