Fence, timeline and android sync points
Jerome Glisse
j.glisse at gmail.com
Tue Aug 12 18:23:54 PDT 2014
On Tue, Aug 12, 2014 at 06:13:41PM -0400, Jerome Glisse wrote:
> Hi,
>
> So i want over the whole fence and sync point stuff as it's becoming a pressing
> issue. I think we first need to agree on what is the problem we want to solve
> and what would be the requirements to solve it.
>
> Problem :
> Explicit synchronization btw different hardware block over a buffer object.
>
> Requirements :
> Share common infrastructure.
> Allow optimal hardware command stream scheduling accross hardware block.
> Allow android sync point to be implemented on top of it.
> Handle/acknowledge exception (like good old gpu lockup).
> Minimize driver changes.
>
> Glossary :
> hardware timeline: timeline bound to a specific hardware block.
> pipeline timeline: timeline bound to a userspace rendering pipeline, each
> point on that timeline can be a composite of several
> different hardware pipeline point.
> pipeline: abstract object representing userspace application graphic pipeline
> of each of the application graphic operations.
> fence: specific point in a timeline where synchronization needs to happen.
>
>
> So now, current include/linux/fence.h implementation is i believe missing the
> objective by confusing hardware and pipeline timeline and by bolting fence to
> buffer object while what is really needed is true and proper timeline for both
> hardware and pipeline. But before going further down that road let me look at
> things and explain how i see them.
>
> Current ttm fence have one and a sole purpose, allow synchronization for buffer
> object move even thought some driver like radeon slightly abuse it and use them
> for things like lockup detection.
>
> The new fence want to expose an api that would allow some implementation of a
> timeline. For that it introduces callback and some hard requirement on what the
> driver have to expose :
> enable_signaling
> [signaled]
> wait
>
> Each of those have to do work inside the driver to which the fence belongs and
> each of those can be call more or less from unexpected (with restriction like
> outside irq) context. So we end up with thing like :
>
> Process 1 Process 2 Process 3
> I_A_schedule(fence0)
> CI_A_F_B_signaled(fence0)
> I_A_signal(fence0)
> CI_B_F_A_callback(fence0)
> CI_A_F_B_wait(fence0)
> Lexique:
> I_x in driver x (I_A == in driver A)
> CI_x_F_y call in driver X from driver Y (CI_A_F_B call in driver A from driver B)
>
> So this is an happy mess everyone call everyone and this bound to get messy.
> Yes i know there is all kind of requirement on what happen once a fence is
> signaled. But those requirement only looks like they are trying to atone any
> mess that can happen from the whole callback dance.
>
> While i was too seduced by the whole callback idea long time ago, i think it is
> a highly dangerous path to take where the combinatorial of what could happen
> are bound to explode with the increase in the number of players.
>
>
> So now back to how to solve the problem we are trying to address. First i want
> to make an observation, almost all GPU that exist today have a command ring
> on to which userspace command buffer are executed and inside the command ring
> you can do something like :
>
> if (condition) execute_command_buffer else skip_command_buffer
>
> where condition is a simple expression (memory_address cop value)) with cop one
> of the generic comparison (==, <, >, <=, >=). I think it is a safe assumption
> that any gpu that slightly matter can do that. Those who can not should fix
> there command ring processor.
>
>
> With that in mind, i think proper solution is implementing timeline and having
> fence be a timeline object with a way simpler api. For each hardware timeline
> driver provide a system memory address at which the lastest signaled fence
> sequence number can be read. Each fence object is uniquely associated with
> both a hardware and a pipeline timeline. Each pipeline timeline have a wait
> queue.
>
> When scheduling something that require synchronization on a hardware timeline
> a fence is created and associated with the pipeline timeline and hardware
> timeline. Other hardware block that need to wait on a fence can use there
> command ring conditional execution to directly check the fence sequence from
> the other hw block so you do optimistic scheduling. If optimistic scheduling
> fails (which would be reported by hw block specific solution and hidden) then
> things can fallback to software cpu wait inside what could be considered the
> kernel thread of the pipeline timeline.
>
>
> From api point of view there is no inter-driver call. All the driver needs to
> do is wakeup the pipeline timeline wait_queue when things are signaled or
> when things go sideway (gpu lockup).
>
>
> So how to implement that with current driver ? Well easy. Currently we assume
> implicit synchronization so all we need is an implicit pipeline timeline per
> userspace process (note this do not prevent inter process synchronization).
> Everytime a command buffer is submitted it is added to the implicit timeline
> with the simple fence object :
>
> struct fence {
> struct list_head list_hwtimeline;
> struct list_head list_pipetimeline;
> struct hw_timeline *hw_timeline;
> uint64_t seq_num;
> work_t timedout_work;
> void *csdata;
> };
>
> So with set of helper function call by each of the driver command execution
> ioctl you have the implicit timeline that is properly populated and each
> dirver command execution get the dependency from the implicit timeline.
>
>
> Of course to take full advantages of all flexibilities this could offer we
> would need to allow userspace to create pipeline timeline and to schedule
> against the pipeline timeline of there choice. We could create file for
> each of the pipeline timeline and have file operation to wait/query
> progress.
>
> Note that the gpu lockup are considered exceptional event, the implicit
> timeline will probably want to continue on other job on other hardware
> block but the explicit one probably will want to decide wether to continue
> or abort or retry without the fault hw block.
>
>
> I realize i am late to the party and that i should have taken a serious
> look at all this long time ago. I apologize for that and if you consider
> this is to late then just ignore me modulo the big warning the crazyness
> that callback will introduce an how bad things bound to happen. I am not
> saying that bad things can not happen with what i propose just that
> because everything happen inside the process context that is the one
> asking/requiring synchronization there will be not interprocess kernel
> callback (a callback that was registered by one process and that is call
> inside another process time slice because fence signaling is happening
> inside this other process time slice).
>
>
> Pseudo code for explicitness :
>
> drm_cs_ioctl_wrapper(struct drm_device *dev, void *data, struct file *filp)
> {
> struct fence *dependency[16], *fence;
> int m;
>
> m = timeline_schedule(filp->implicit_pipeline, dev->hw_pipeline,
> dependency, 16, &fence);
> if (m < 0)
> return m;
> if (m >= 16) {
> // alloc m and recall;
> }
> dev->cs_ioctl(dev, data, filp, dev->implicit_pipeline, dependency, fence);
> }
>
> int timeline_schedule(ptimeline, hwtimeline, timeout,
> dependency, mdep, **fence)
> {
> // allocate fence set hw_timeline and init work
> // build up list of dependency by looking at list of pending fence in
> // timeline
> }
>
>
>
> // If device driver schedule job hopping for all dependency to be signaled then
> // it must also call this function with csdata being a copy of what needs to be
> // executed once all dependency are signaled
> void timeline_missed_schedule(timeline, fence, void *csdata)
> {
> INITWORK(fence->work, timeline_missed_schedule_worker)
> fence->csdata = csdata;
> schedule_delayed_work(fence->work, default_timeout)
> }
>
> void timeline_missed_schedule_worker(work)
> {
> driver = driver_from_fence_hwtimeline(fence)
>
> // Make sure that each of the hwtimeline dependency will fire irq by
> // calling a driver function.
> timeline_wait_for_fence_dependency(fence);
> driver->execute_cs(driver, fence);
> }
>
> // This function is call by driver code that signal fence (could be call from
> // interrupt context). It is responsabilities of device driver to call that
> // function.
> void timeline_signal(hwtimeline)
> {
> for_each_fence(fence, hwtimeline->fences, list_hwtimeline) {
> wakeup(fence->pipetimeline->wait_queue);
> }
> }
Btw as extra note, because of implicit timeline any shared object schedule on a
hw timeline must add a fence to all the implicit timeline where this object exist.
Also there is no need to have a fence pointer per object.
>
>
> Cheers,
> Jérôme
More information about the dri-devel
mailing list