[RFC] Explicit synchronization for Nouveau

Mon Sep 29 10:20:07 PDT 2014

On Mon, Sep 29, 2014 at 11:42:19AM -0400, Jerome Glisse wrote:
> On Mon, Sep 29, 2014 at 09:43:02AM +0200, Daniel Vetter wrote:
> > On Fri, Sep 26, 2014 at 01:00:05PM +0300, Lauri Peltonen wrote:
> > > 
> > > Hi guys,
> > > 
> > > 
> > > I'd like to start a new thread about explicit fence synchronization.  This time
> > > with a Nouveau twist. :-)
> > > 
> > > First, let me define what I understand by implicit/explicit sync:
> > > 
> > > Implicit synchronization
> > > * Fences are attached to buffers
> > > * Kernel manages fences automatically based on buffer read/write access
> > > 
> > > Explicit synchronization
> > > * Fences are passed around independently
> > > * Kernel takes and emits fences to/from user space when submitting work
> > > 
> > > Implicit synchronization is already implemented in open source drivers, and
> > > works well for most use cases.  I don't seek to change any of that.  My
> > > proposal aims at allowing some drm drivers to operate in explicit sync mode to
> > > get maximal performance, while still remaining fully compatible with the
> > > implicit paradigm.
> > 
> > Yeah, pretty much what we have in mind on the i915 side too. I didn't look
> > too closely at your patches, so just a few high level comments on your rfc
> > here.
> > 
> > > I will try to explain why I think we should support the explicit model as well.
> > > 
> > > 
> > > 1. Bindless graphics
> > > 
> > > Bindless graphics is a central concept when trying to reduce the OpenGL driver
> > > overhead.  The idea is that the application can bind a large set of buffers to
> > > the working set up front using extensions such as GL_ARB_bindless_texture, and
> > > they remain resident until the application releases them (note that compute
> > > APIs have typically similar semantics).  These working sets can be huge,
> > > hundreds or even thousands of buffers, so we would like to opt out from the
> > > per-submit overhead of acquiring locks, waiting for fences, and storing fences.
> > > Automatically synchronizing these working sets in kernel will also prevent
> > > parallelism between channels that are sharing the working set (in fact sharing
> > > just one buffer from the working set will cause the jobs of the two channels to
> > > be serialized).
> > > 
> > > 2. Evolution of graphics APIs
> > > 
> > > The graphics API evolution seems to be going to a direction where game engine
> > > and middleware vendors demand more control over work submission and
> > > synchronization.  We expect that this trend will continue, and more and more
> > > synchronization decisions will be pushed to the API level.  OpenGL and EGL
> > > already provide good explicit command stream level synchronization primitives:
> > > glFenceSync and EGL_KHR_wait_sync.  Their use is also encouraged - for example
> > > EGL_KHR_image_base spec clearly states that the application is responsible for
> > > synchronizing accesses to EGLImages.  If the API that is exposed to developers
> > > gives the control over synchronization to the developer, then implicit waits
> > > that are inserted by the kernel are unnecessary and unexpected, and can
> > > severely hurt performance.  It also makes it easy for the developer to write
> > > code that happens to work on Linux because of implicit sync, but will fail on
> > > other platforms.
> > > 
> > > 3. Suballocation
> > > 
> > > Using user space suballocation can help reduce the overhead when a large number
> > > of small textures are used.  Synchronizing suballocated surfaces implicitly in
> > > kernel doesn't make sense - many channels should be able to access the same
> > > kernel-level buffer object simultaneously.
> > > 
> > > 4. Buffer sharing complications
> > > 
> > > This is not really an argument for explicit sync as such, but I'd like to point
> > > out that sharing buffers across SoC engines is often much more complex than
> > > just exporting and importing a dma-buf and waiting for the dma-buf fences.
> > > Sometimes we need to do color format or tiling layout conversion.  Sometimes,
> > > at least on Tegra, we need to decompress buffers when we pass them from the GPU
> > > to an engine that doesn't support framebuffer compression.  These things are
> > > not uncommon, particularly when we have SoC's that combine licensed IP blocks
> > > from different vendors.  My point is that user space is already heavily
> > > involved when sharing buffers between drivers, and giving it some more control
> > > over synchronization is not adding that much complexity.
> > > 
> > > 
> > > Because of the above arguments, I think it makes sense to let some user space
> > > drm drivers opt out from implicit synchronization, while allowing them to still
> > > remain fully compatible with the rest of the drm world that uses implicit
> > > synchronization.  In practice, this would require three things:
> > > 
> > > (1) Support passing fences (that are not tied to buffer objects) between kernel
> > >     and user space.
> > > 
> > > (2) Stop automatically storing fences to the buffers that user space wants to
> > >     synchronize explicitly.
> > 
> > The problem with this approach is that you then need hw faulting to make
> > sure the memory is there. Implicit fences aren't just used for syncing,
> > but also to make sure that the gpu still has access to the buffer as long
> > as it needs it. So you need at least a non-exclusive fence attached for
> > each command submission.
> > 
> > Of course on Android you don't have swap (would kill the puny mmc within
> > seconds) and you don't care for letting userspace pin most of memory for
> > gfx. So you'll get away with no fences at all. But for upstream I don't
> > see a good solution unfortunately. Ideas very much welcome.
> 
> Well i am gonna repeat myself. But yes you can do explicit without associating
> fence (at least no struct alloc) just associate a unique per command stream
> number and have it be global ie irrespective of different execution pipeline
> your hw have.
> 
> For non scheduling GPU, today generation roughly, you keep buffer on lru and
> you know you can not evict buffer to swap for those that have an active id
> (hw did not yet write the sequence number back). You update the lru with each
> command stream ioctl.
> 
> For binding to GPU GART you can do that as a preamble to the command stream
> which most hardware (AMD, Intel, NVidia) should be able to do.
> 
> For VRAM you have several choice that depends on how you want to manage VRAM.
> For instance you might want to use it more like a cache and have each command
> stream preamble with a bunch of copy to VRAM and posibly bunch of post copy
> back to RAM. Or you can hold to current scheme but buffer move now becomes
> preamble to command stream (ie buffer move are scheduled as a preamble to
> command stream).
> 
> 
> So i do not see what you would consider rocket science about this ?

You just implement fences with a seqno value. Indeed not rocket science
;-)

Cheers, Daniel

> 
> Cheers,
> Jérôme
> 
> > 
> > > (3) Allow user space to attach an explicit fence to dma-buf when exporting to
> > >     another driver that uses implicit sync.
> > > 
> > > There are still some open issues beyond these.  For example, can we skip
> > > acquiring the ww mutex for explicitly synchronized buffers?  I think we could
> > > eventually, at least on unified memory systems where we don't need to migrate
> > > between heaps (our downstream Tegra GPU driver does not lock any buffers at
> > > submit, it just grabs refcounts for hw).  Another quirk is that now Nouveau
> > > waits on the buffer fences when closing the gem object to ensure that it
> > > doesn't unmap too early.  We need to rework that for explicit sync, but that
> > > shouldn't be difficult.
> > 
> > See above, but you can't avoid to attach fences as long as we still use a
> > buffer-object based gfx memory management model. At least afaics. Which
> > means you need the ordering guarantees imposed by ww mutexes to ensure
> > that the oddball implicit ordered client can't deadlock the kernel's
> > memory management code.
> > 
> > > I have written a prototype that demonstrates (1) by adding explicit sync fd
> > > support to Nouveau.  It's not a lot of code, because I only use a relatively
> > > small subset of the android sync driver functionality.  Thanks to Maarten's
> > > rewrite, all I need to do is to allow creating a sync_fence from a drm fence in
> > > order to pass it to user space.  I don't need to use sync_pt or sync_timeline,
> > > or fill in sync_timeline_ops.
> > > 
> > > I can see why the upstream has been reluctant to de-stage the android sync
> > > driver in its current form, since (even though it now builds on struct fence)
> > > it still duplicates some of the drm fence concepts.  I'd like to think that my
> > > patches only use the parts of the android sync driver that genuinely are
> > > missing from the drm fence model: allowing user space to operate on fence
> > > objects that are independent of buffer objects.
> > 
> > Imo de-staging the android syncpt stuff needs to happen first, before
> > drivers can use it. Since non-staging stuff really shouldn't depend upon
> > code from staging.
> > 
> > > The last two patches are mocks that show how (2) and (3) might work out.  I
> > > haven't done any testing with them yet.  Before going any further, I'd like to
> > > get your feedback.  Can you see the benefits of explicit sync as an alternative
> > > synchronization model?  Do you think we could use the android sync_fence for
> > > passing fences between user space?  Or did you have something else in mind for
> > > explicit sync in the drm world?
> > 
> > I'm all for adding explicit syncing. Our plans are roughly.
> > - Add both an in and and out fence to execbuf to sync with other rendering
> >   and give userspace a fence back. Needs to different flags probably.
> > 
> > - Maybe add an ioctl to dma-bufs to get at the current implicit fences
> >   attached to them (both an exclusive and non-exclusive version). This
> >   should help with making explicit and implicit sync work together nicely.
> > 
> > - Add fence support to kms. Probably only worth it together with the new
> >   atomic stuff. Again we need an in fence to wait for (one for each
> >   buffer) and an out fence. The later can easily be implemented by
> >   extending struct drm_event, which means not a single driver code line
> >   needs to be changed for this.
> > 
> > - For de-staging android syncpts we need to de-clutter the internal
> >   interfaces and also review all the ioctls exposed. Like you say it
> >   should be just the userspace interface for struct drm_fence. Also, it
> >   needs testcases and preferrably manpages.
> > 
> > Unfortunately it looks like Intel won't do this all for you due to a bunch
> > of hilarious internal reasons :( At least not anytime soon.
> > 
> > Cheers, Daniel
> > -- 
> > Daniel Vetter
> > Software Engineer, Intel Corporation
> > +41 (0) 79 365 57 48 - http://blog.ffwll.ch
> > _______________________________________________
> > dri-devel mailing list
> > dri-devel at lists.freedesktop.org
> > http://lists.freedesktop.org/mailman/listinfo/dri-devel

-- 
Daniel Vetter
Software Engineer, Intel Corporation
+41 (0) 79 365 57 48 - http://blog.ffwll.ch