Fence, timeline and android sync points

Thu Aug 14 08:55:51 PDT 2014

On Thu, Aug 14, 2014 at 10:23:30AM -0400, Jerome Glisse wrote:
> On Thu, Aug 14, 2014 at 11:08:34AM +0200, Daniel Vetter wrote:
> > On Wed, Aug 13, 2014 at 01:07:20PM -0400, Jerome Glisse wrote:
> > > Let me make this crystal clear this must be a valid kernel page that have a
> > > valid kernel mapping for the lifetime of the device. Hence there is no access
> > > to mmio space or anything, just a regular kernel page. If can not rely on that
> > > this is a sad world.
> > > 
> > > That being said, yes i am aware that some device incapacity to write to such
> > > a page. For those dumb hardware what you need to do is have the irq handler
> > > write to this page on behalf of the hardware. But i would like to know any
> > > hardware than can not write a single dword from its ring buffer.
> > > 
> > > The only tricky part in this, is when device is unloaded and driver is removing
> > > itself, it obviously need to synchronize itself with anyone possibly waiting on
> > > it and possibly reading. But truly this is not that hard to solve.
> > > 
> > > So tell me once the above is clear what kind of scary thing can happen when cpu
> > > or a device _read_ a kernel page ?
> > 
> > It's not reading it, it's making sense of what you read. In i915 we had
> > exactly the (timeline, seqno) value pair design for fences for a long
> > time, and we're switching away from it since it stops working when you
> > have preemption and scheduler. Or at least it gets really interesting to
> > interpret the data written into the page.
> > 
> > So I don't want to expose that to other drivers if we decided that
> > exposing this internally is a stupid idea.
> 
> I am well aware of that, but context scheduling really is what the timeline i
> talk about is. The user timeline should be consider like a single cpu thread on
> to which operation involving different hw are scheduled. You can have the hw
> switch from one timeline to another and a seqno is only meaningful per timeline.
> 
> The whole preemption and scheduling is something bound to happen on gpu and we
> will want to integrate with core scheduler to manage time slice allocated to
> process, but the we need the concept of thread in which operation on same hw
> block are processed in a linear fashion but still allowing concurrency with
> other hw block.

Well yeah, it's just that a gpu thread doesn't have a lot to do with a cpu
process, at least in the current drm world. It would be nice make that a
bit more cross-driver for management, but we didn't even manage to pull
that of for gpu memory. So I don't think it'll happen anytime soonish.

> > > > > > So from that pov (presuming I didn't miss anything) your proposal is
> > > > > > identical to what we have, minor some different color choices (like where
> > > > > > to place the callback queue).
> > > > > 
> > > > > No callback is the mantra here, and instead of bolting free living fence
> > > > > to buffer object, they are associated with timeline which means you do not
> > > > > need to go over all buffer object to know what you need to wait for.
> > > > 
> > > > Ok, then I guess I didn't understand that part of your the proposal. Can
> > > > you please elaborate a bit more how you want to synchronize mulitple
> > > > drivers accessing a dma-buf object and what piece of state we need to
> > > > associate to the dma-buf to make this happen?
> > > 
> > > Beauty of it you associate ziltch to the buffer. So for existing cs ioctl where
> > > the implicit synchronization is the rule it enforce mandatory synchronization
> > > accross all hw timeline on which a buffer shows up :
> > >   for_each_buffer_in_cmdbuffer(buffer, cmdbuf) {
> > >     if (!cmdbuf_write_to_buffer(buffer, cmdbuf))
> > >       continue;
> > >     for_each_process_sharing_buffer(buffer, process) {
> > >       schedule_fence(process->implicit_timeline, cmdbuf->fence)
> > >     }
> > >   }
> > > 
> > > Next time another process use current ioctl with implicit sync it will synch with
> > > the last fence for any shared resource. This might sounds bad but truely as it is
> > > right now this is already how it happens (at least for radeon).
> > 
> > Well i915 is a lot better than that. And I'm not going to implement some
> > special-case for dma-buf shared buffers just because radeon sucks and
> > wants to enforce that suckage on everyone else.
> 
> I guess i am having hard time to express myself, what i am saying here is that
> implicit synchronization sucks because it has to assume the worst case and this
> is what current code does, and i am sure intel is doing something similar with
> today code.
> 
> Explicit synchronization allow more flexibility but fence code as it is designed
> does not allow to fully do down that line. By associating fence to buffer object
> which is the biggest shortcoming of implicit sync.

I don't see how your statement that implicit sync sucks maps to the
reality of i915 where it doesn't. The only thing you can't do is shared
buffer objects that you suballocate. And userspace is always allowed to
lie about the access mode (if it lies too badly it will simply read zero
page) so can forgoe implicit sync.

> > So let's cut this short: If you absolutely insist I guess we could ditch
> > the callback stuff from fences, but I really don't see the problem with
> > radeon just not using that and then being happy. We can easily implement a
> > bit of insulation code _just_ for radeon so that the only thing radeon
> > does is wake up a process (which then calls the callback if it's something
> > special).
> 
> Like i said feel free to ignore me. I am just genuinely want to have the best
> solution inside the linux kernel and i do think that fence and callback and
> buffer association is not that solution. I tried to explain why but i might
> be failing or missing something.
> 
> > Otoh I don't care about what ttm and radeon do, for i915 the important
> > stuff is integration with android syncpts and being able to do explicit
> > fencing for e.g. svm stuff. We can do that with what's merged in 3.17 and
> > I expect that those patches will land in 3.18, at least the internal
> > integration.
> > 
> > It would be cool if we could get tear-free optimus working on desktop
> > linux, but that flat out doesn't pay my bills here. So I think I'll let
> > you guys figure this out yourself.
> 
> Sad to learn that Intel no longer have any interest in the linux desktop.

Well the current Linux DRI stack works fairly well on Linux with Intel I
think, so I don't think anyone can claim we don't care. It's only starts
to suck if you mix in prime.

And as a matter of fact the current Linux DRI model uses implicit sync, so
that's what dma-buf currently implements. Of course we can implement an
explicit sync gpu model (and it will come for at least opencl and similar
stuff), and probably also for interactions with v4l, but I don't see any
patches to rewrite the current X stack we have.

And yeah we kinda optimized that one a bit on Intel since we actually care
about X ;-)

What I'll stop caring about is how exactly radeon integrates into this,
not the linux desktop in general.
-Daniel
-- 
Daniel Vetter
Software Engineer, Intel Corporation
+41 (0) 79 365 57 48 - http://blog.ffwll.ch