Fence, timeline and android sync points

Thu Aug 14 07:23:30 PDT 2014

On Thu, Aug 14, 2014 at 11:08:34AM +0200, Daniel Vetter wrote:
> On Wed, Aug 13, 2014 at 01:07:20PM -0400, Jerome Glisse wrote:
> > Let me make this crystal clear this must be a valid kernel page that have a
> > valid kernel mapping for the lifetime of the device. Hence there is no access
> > to mmio space or anything, just a regular kernel page. If can not rely on that
> > this is a sad world.
> > 
> > That being said, yes i am aware that some device incapacity to write to such
> > a page. For those dumb hardware what you need to do is have the irq handler
> > write to this page on behalf of the hardware. But i would like to know any
> > hardware than can not write a single dword from its ring buffer.
> > 
> > The only tricky part in this, is when device is unloaded and driver is removing
> > itself, it obviously need to synchronize itself with anyone possibly waiting on
> > it and possibly reading. But truly this is not that hard to solve.
> > 
> > So tell me once the above is clear what kind of scary thing can happen when cpu
> > or a device _read_ a kernel page ?
> 
> It's not reading it, it's making sense of what you read. In i915 we had
> exactly the (timeline, seqno) value pair design for fences for a long
> time, and we're switching away from it since it stops working when you
> have preemption and scheduler. Or at least it gets really interesting to
> interpret the data written into the page.
> 
> So I don't want to expose that to other drivers if we decided that
> exposing this internally is a stupid idea.

I am well aware of that, but context scheduling really is what the timeline i
talk about is. The user timeline should be consider like a single cpu thread on
to which operation involving different hw are scheduled. You can have the hw
switch from one timeline to another and a seqno is only meaningful per timeline.

The whole preemption and scheduling is something bound to happen on gpu and we
will want to integrate with core scheduler to manage time slice allocated to
process, but the we need the concept of thread in which operation on same hw
block are processed in a linear fashion but still allowing concurrency with
other hw block.

> 
> > > 
> > > > > So from that pov (presuming I didn't miss anything) your proposal is
> > > > > identical to what we have, minor some different color choices (like where
> > > > > to place the callback queue).
> > > > 
> > > > No callback is the mantra here, and instead of bolting free living fence
> > > > to buffer object, they are associated with timeline which means you do not
> > > > need to go over all buffer object to know what you need to wait for.
> > > 
> > > Ok, then I guess I didn't understand that part of your the proposal. Can
> > > you please elaborate a bit more how you want to synchronize mulitple
> > > drivers accessing a dma-buf object and what piece of state we need to
> > > associate to the dma-buf to make this happen?
> > 
> > Beauty of it you associate ziltch to the buffer. So for existing cs ioctl where
> > the implicit synchronization is the rule it enforce mandatory synchronization
> > accross all hw timeline on which a buffer shows up :
> >   for_each_buffer_in_cmdbuffer(buffer, cmdbuf) {
> >     if (!cmdbuf_write_to_buffer(buffer, cmdbuf))
> >       continue;
> >     for_each_process_sharing_buffer(buffer, process) {
> >       schedule_fence(process->implicit_timeline, cmdbuf->fence)
> >     }
> >   }
> > 
> > Next time another process use current ioctl with implicit sync it will synch with
> > the last fence for any shared resource. This might sounds bad but truely as it is
> > right now this is already how it happens (at least for radeon).
> 
> Well i915 is a lot better than that. And I'm not going to implement some
> special-case for dma-buf shared buffers just because radeon sucks and
> wants to enforce that suckage on everyone else.

I guess i am having hard time to express myself, what i am saying here is that
implicit synchronization sucks because it has to assume the worst case and this
is what current code does, and i am sure intel is doing something similar with
today code.

Explicit synchronization allow more flexibility but fence code as it is designed
does not allow to fully do down that line. By associating fence to buffer object
which is the biggest shortcoming of implicit sync.

> 
> So let's cut this short: If you absolutely insist I guess we could ditch
> the callback stuff from fences, but I really don't see the problem with
> radeon just not using that and then being happy. We can easily implement a
> bit of insulation code _just_ for radeon so that the only thing radeon
> does is wake up a process (which then calls the callback if it's something
> special).

Like i said feel free to ignore me. I am just genuinely want to have the best
solution inside the linux kernel and i do think that fence and callback and
buffer association is not that solution. I tried to explain why but i might
be failing or missing something.

> Otoh I don't care about what ttm and radeon do, for i915 the important
> stuff is integration with android syncpts and being able to do explicit
> fencing for e.g. svm stuff. We can do that with what's merged in 3.17 and
> I expect that those patches will land in 3.18, at least the internal
> integration.
> 
> It would be cool if we could get tear-free optimus working on desktop
> linux, but that flat out doesn't pay my bills here. So I think I'll let
> you guys figure this out yourself.

Sad to learn that Intel no longer have any interest in the linux desktop.

Cheers,
Jérôme