[RFC PATCH v2] dmabuf-sync: Introduce buffer synchronization framework

Fri Jun 21 04:01:48 PDT 2013

2013/6/21 Lucas Stach <l.stach at pengutronix.de>

> Am Donnerstag, den 20.06.2013, 20:15 +0900 schrieb Inki Dae:
> [...]
> > > > > You already need some kind of IPC between the two tasks, as I
> suspect
> > > > > even in your example it wouldn't make much sense to queue the
> buffer
> > > > > over and over again in task B without task A writing anything to
> it.
> > > So
> > > > > task A has to signal task B there is new data in the buffer to be
> > > > > processed.
> > > > >
> > > > > There is no need to share the buffer over and over again just to
> get
> > > the
> > > > > two processes to work together on the same thing. Just share the fd
> > > > > between both and then do out-of-band completion signaling, as you
> need
> > > > > this anyway. Without this you'll end up with unpredictable
> behavior.
> > > > > Just because sync allows you to access the buffer doesn't mean it's
> > > > > valid for your use-case. Without completion signaling you could
> easily
> > > > > end up overwriting your data from task A multiple times before
> task B
> > > > > even tries to lock the buffer for processing.
> > > > >
> > > > > So the valid flow is (and this already works with the current
> APIs):
> > > > > Task A                                    Task B
> > > > > ------                                    ------
> > > > > CPU access buffer
> > > > >          ----------completion signal--------->
> > > > >                                           qbuf (dragging buffer
> into
> > > > >                                           device domain, flush
> caches,
> > > > >                                           reserve buffer etc.)
> > > > >                                                     |
> > > > >                                           wait for device
> operation to
> > > > >                                           complete
> > > > >                                                     |
> > > > >                                           dqbuf (dragging buffer
> back
> > > > >                                           into CPU domain,
> invalidate
> > > > >                                           caches, unreserve)
> > > > >         <---------completion signal------------
> > > > > CPU access buffer
> > > > >
> > > >
> > > > Correct. In case that data flow goes from A to B, it needs some kind
> > > > of IPC between the two tasks every time as you said. Then, without
> > > > dmabuf-sync, how do think about the case that two tasks share the
> same
> > > > buffer but these tasks access the buffer(buf1) as write, and data of
> > > > the buffer(buf1) isn't needed to be shared?
> > > >
> > > Sorry, I don't see the point you are trying to solve here. If you share
> > > a buffer and want its content to be clearly defined at every point in
> > > time you have to synchronize the tasks working with the buffer, not
> just
> > > the buffer accesses itself.
> > >
> > > Easiest way to do so is doing sync through userspace with out-of-band
> > > IPC, like in the example above.
> >
> > In my opinion, that's not definitely easiest way. What I try to do is
> > to avoid using *the out-of-band IPC*. As I mentioned in document file,
> > the conventional mechanism not only makes user application
> > complicated-user process needs to understand how the device driver is
> > worked-but also may incur performance overhead by using the
> > out-of-band IPC. The above my example may not be enough to you but
> > there would be other cases able to use my approach efficiently.
> >
>
> Yeah, you'll some knowledge and understanding about the API you are
> working with to get things right. But I think it's not an unreasonable
> thing to expect the programmer working directly with kernel interfaces
> to read up on how things work.
>
> Second thing: I'll rather have *one* consistent API for every subsystem,
> even if they differ from each other than having to implement this
> syncpoint thing in every subsystem. Remember: a single execbuf in DRM
> might reference both GEM objects backed by dma-buf as well native SHM or
> CMA backed objects. The dma-buf-mgr proposal already allows you to
> handle dma-bufs much the same way during validation than native GEM
> objects.
>

Actually, at first I had implemented a fence helper framework based on
reservation and dma fence to provide easy-use-interface for device
drivers. However, that was wrong implemention: I had not only customized
the dma fence but also not considered dead lock issue. After that, I have
reimplemented it as dmabuf sync to solve dead issue, and at that time, I
realized that we first need to concentrate on the most basic thing: the
fact CPU and CPU, CPU and DMA, or DMA and DMA can access a same buffer, And
the fact simple is the best, and the fact we need not only kernel side but
also user side interfaces. After that, I collected what is the common part
for all subsystems, and I have devised this dmabuf sync framework for it.
I'm not really specialist in Desktop world. So question. isn't the execbuf
used only for the GPU? the gpu has dedicated video memory(VRAM) so it needs
migration mechanism between system memory and the dedicated video memory,
and also to consider ordering issue while be migrated.

>
> And to get back to my original point: if you have more than one task
> operating together on a buffer you absolutely need some kind of real IPC
> to sync them up and do something useful. Both you syncpoints and the
> proposed dma-fences only protect the buffer accesses to make sure
> different task don't stomp on each other. There is nothing in there to
> make sure that the output of your pipeline is valid. You have to take
> care of that yourself in userspace. I'll reuse your example to make it
> clear what I mean:
>
> Task A                                         Task B
> ------                                         -------
> dma_buf_sync_lock(buf1)
> CPU write buf1
> dma_buf_sync_unlock(buf1)
>           ---------schedule Task A again-------
> dma_buf_sync_lock(buf1)
> CPU write buf1
> dma_buf_sync_unlock(buf1)
>             ---------schedule Task B---------
>                                                qbuf(buf1)
>                                                   dma_buf_sync_lock(buf1)
>                                                ....
>
> This is what can happen if you don't take care of proper syncing. Task A
> writes something to the buffer in expectation that Task B will take care
> of it, but before Task B even gets scheduled Task A overwrites the
> buffer again. Not what you wanted, isn't it?
>

Exactly wrong example. I had already mentioned about that. "In case that
data flow goes from A to B, it needs some kind of IPC between the two tasks
every time"  So again, your example would have no any problem in case that
*two tasks share the same buffer but these tasks access the buffer(buf1) as
write, and data of the buffer(buf1) isn't needed to be shared*.  They just
need to use the buffer as *storage*. So All they want is to avoid stomping
on the buffer in this case.

>
> So to make sure the output of a pipeline of some kind is what you expect
> you have to do syncing with IPC

So not true.

> . And once you do CPU access it is a
> synchronous thing in the stream of events. I see that you might want to
> have some kind of bracketed CPU access even for the fallback mmap case
> for things like V4L2 that don't provide explicit sync by their own, but
> in no way I can see why we would need a user/kernel shared syncpoint for
> this.
>
> > > A more advanced way to achieve this
> > > would be using cross-device fences to avoid going through userspace for
> > > every syncpoint.
> > >
> >
> > Ok, maybe there is something I missed. So question. What is the
> > cross-device fences? dma fence?. And how we can achieve the
> > synchronization mechanism without going through user space for every
> > syncpoint; CPU and DMA share a same buffer?. And could you explain it
> > in detail as long as possible like I did?
> >
> Yeah I'm talking about the proposed dma-fences. They would allow you to
> just queue things into the kernel without waiting for a device operation
> to finish. But you still have to make sure that your commands have the
> right order and don't go wild. So for example you could do something
> like this:
>
> Userspace                                   Kernel
> ---------                                   ------
> 1. build DRM command stream
> rendering into buf1
> 2. queue command stream with execbuf
>                                             1. validate command stream
>                                              1.1 reference buf1 for writing
>                                                  through dma-buf-mgr
>                                             2. kick off GPU processing
> 3. qbuf buf1 into V4L2
>                                             3. reference buf1 for reading
>                                              3.1 wait for fence from GPU to
>                                                  signal
>                                             4. kick off V4L2 processing
>
>
That seems like very specific to Desktop GPU. isn't it?

> So you don't need to wait in userspace and potentially avoid some
> context switches,

Also not true.

> but you still have to make sure that GPU commands are
> queued before you queue the V4L2 operation to make sure things get
> operated on in the right order.
>
>
>

I'd like to say you that my approach is not perfact so may definietly
have many problems and addition works - actually, I found some problems and
are solving them, in addition, the implememtation to generic user side
interfacing mechanism is in progress for the destination - , and thanks to
your comments. However, I think we can try to do for more better something.
Lastly, I'll look forward to keeping up your good advices.

Thanks,
Inki Dae

> Regards,
> Lucas
>
> --
> Pengutronix e.K.                           | Lucas Stach                 |
> Industrial Linux Solutions                 | http://www.pengutronix.de/  |
> Peiner Str. 6-8, 31137 Hildesheim, Germany | Phone: +49-5121-206917-5076 |
> Amtsgericht Hildesheim, HRA 2686           | Fax:   +49-5121-206917-5555 |
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.freedesktop.org/archives/dri-devel/attachments/20130621/71d8da2d/attachment.html>