[RFC] Explicit synchronization for Nouveau

Fri Oct 3 14:56:00 PDT 2014

Riley (CCed) and I will be at Plumbers in a couple weeks.

There is a session on sync planned in the Android track, and of course
we'll be available to chat.

On Thu, Oct 2, 2014 at 1:44 PM, Daniel Vetter <daniel at ffwll.ch> wrote:

> On Thu, Oct 02, 2014 at 05:59:51PM +0300, Lauri Peltonen wrote:
> > +Rom who seems to be presenting about mainlining android sync at linux
> plumbers
>
> Also add Greg KH as fyi that we're working on de-stage one of the android
> subsystems.
>
> > On Wed, Oct 01, 2014 at 05:58:52PM +0200, Maarten Lankhorst wrote:
> > > You could neuter implicit fences by always attaching the fences as
> > > shared when explicit syncing is used. This would work correctly with
> > > eviction, and wouldn't cause any unneeded syncing. :)
> >
> > Yes, that will probably work!  So, just to reiterate that I understood
> you and
> > Daniel correctly:
> >
> > - de-stage sync_fence and it's user space API (the tedious part)
> > - add dma-buf ioctls for extracting and attaching explicit fences
> > - Nouveau specific: allow flagging gem buffers for explicit sync
> >   - do not check pre-fences from explicitly synced buffers at submit
> >   - continue to attach a shared fence after submit so that pinning and
> >     unmapping continue to work
>
>     - Have explicit in/out fences for the pushbuf ioctl is missing I
>       guess in this step?
>
> I also think we need some kind of demonstration vehicle using nouveau to
> satisfy Dave Airlie's open-source userspace requirements for new
> interfaces. Might be good to chat with him to make sure we have that
> covered (and how much needs to be there really).
>
> > Then working sets and getting rid of locking all buffers individually
> > can be dealt with later as an optimization.
>
> Yeah, sounds like a good plan.
>
> > On Wed, Oct 01, 2014 at 07:27:21PM +0200, Daniel Vetter wrote:
> > > On Wed, Oct 01, 2014 at 06:14:16PM +0300, Lauri Peltonen wrote:
> > > > Implicit fences attached to individual buffers are one way for
> residency
> > > > management.  Do you think a working set based model could work in
> the DRM
> > > > framework?  For example, something like this:
> > > >
> > > > - Allow user space to create "working set objects" and associate
> buffers with
> > > >   them.  If the user space doesn't want to manage working sets
> explicitly, it
> > > >   could also use an implicit default working set that contains all
> buffers that
> > > >   are mapped to the channel vm (on Android we could always use the
> default
> > > >   working set since we don't need to manage residency).  The working
> sets are
> > > >   initially marked as dirty.
> > > > - User space tells which working sets are referenced by each work
> submission.
> > > >   Kernel locks these working sets, pins all buffers in dirty working
> sets, and
> > > >   resets the dirty bits.  After kicking off work, kernel stores the
> fence to
> > > >   the _working sets_, and then releases the locks (if an implicit
> default
> > > >   working set is used, then this would be roughly equivalent to
> storing a fence
> > > >   to channel vm that tells "this is the last hw operation that might
> have
> > > >   touched buffers in this address space").
> > > > - If swapping doesn't happen, then we just need to check the working
> set dirty
> > > >   bits at each submit.
> > > > - When a buffer is swapped out, all working sets that refer to it
> need to be
> > > >   marked as dirty.
> > > > - When a buffer is swapped out or unmapped, we need to wait for the
> fences from
> > > >   all working sets that refer to the buffer.
> > > >
> > > > Initially one might think of working sets as a mere optimization -
> we now need
> > > > to process a few working sets at every submit instead of many
> individual
> > > > buffers.  However, it makes a huge difference because of fences:
> fences that
> > > > are attached to buffers are used for implicitly synchronizing work
> across
> > > > different channels and engines.  They are in the performance
> critical path, and
> > > > we want to carefully manage them (that's the idea of explicit
> synchronization).
> > > > The working set fences, on the other hand, would only be used to
> guarantee that
> > > > we don't swap out or unmap something that the GPU might be
> accessing.  We never
> > > > need to wait for those fences (except when swapping or unmapping),
> so we can be
> > > > conservative without hurting performance.
> > >
> > > Yeah, within the driver (i.e. for private objects which are never
> exported
> > > to dma_buf) we can recently do stuff like this. And your above idea is
> > > roughly one of the things we're tossing around for i915.
> > >
> > > But the cool stuff with drm is that cmd submission is driver-specific,
> so
> > > you can just go wild with nouveau. Of course you have to coninvce the
> > > nouveau guys (and also have open-source users for the new interface).
> > >
> > > For shared buffers I think we should stick with the implicit fences
> for a
> > > while simply because I'm not sure whether it's really worth the fuzz.
> And
> > > reworking all the drivers and dma-buf for some working sets is a lot of
> > > fuzz ;-) Like Maarten said you can mostly short-circuit the implicit
> > > fencing by only attaching shared fences.
> >
> > Yes, I'll try to do that.
> >
> >
> > > In case you're curious: The idea is to have a 1:1 association between
> > > ppgtt address spaces and what you call the working set above, to
> implement
> > > the buffer svm model in ocl2. Mostly because we expect that
> applications
> > > won't get the more fine-grained buffer list right anyway. And this
> kind of
> > > gang-scheduling of working set sizes should be more efficient for the
> > > usual case where everything fits.
> >
> > If I understood correctly, this would be exactly the same as what I
> called the
> > "default working set" above.  On Android we don't care much about finer
> grained
> > working sets either, because of UMA and no swapping.
>
> Yeah, that's pretty much the idea. Well with ocl the SVM buffer working
> set sizes are attached to an ocl context, but I don't think there'll be
> more than one of those around really (except maybe when different
> libraries all use ocl, but don't work together on the same data).
>
> > > > > Imo de-staging the android syncpt stuff needs to happen first,
> before drivers
> > > > > can use it. Since non-staging stuff really shouldn't depend upon
> code from
> > > > > staging.
> > > >
> > > > Fully agree.  I thought the best way towards that would be to show
> some driver
> > > > code that _would_ use it. :)
> > >
> > > Oh, there's the usual chicken&egg where we need a full-blown prototype
> > > before we can start merging. Interface work on upstream is super-hard,
> but
> > > given the ridiculous backwards compat guarantees Linus expects us to
> keep
> > > up totally justified. Mistakes are really expensive. So I'm happy to
> see
> > > you charge ahead here.
> >
> > Given that I'm not used to working with upstream, don't expect too much
> from my
> > "charging ahead". :)  I'm still secretly hoping that the Android guys at
> Google
> > would jump in to help, now that we seem to agree that we could de-stage
> > sync_fence.
>
> Forget that, imo the android guys are 100% absorbed with their own stuff,
> at least on the gfx side. Occasionally they pipe up with "btw this is what
> we're doing now".
>
> Also, the problem is that to actually push android stuff out of staging
> you need a use-case in upstream, which means an open-source gpu driver.
> There's not a lot of companies who have both that and ship android, and
> definitely not the nexus/android lead platforms.
>
> Display side would be easier since there's a bunch of kms drivers now
> upstream. But given that google decided to go ahead with their own adf
> instead of drm-kms that's also a non-starter.
>
> Heck, I have a hard time draggin our own android folks here at intel into
> upstream work ...
>
> > > > > I'm all for adding explicit syncing. Our plans are roughly.  - Add
> both an in
> > > > > and and out fence to execbuf to sync with other rendering and give
> userspace
> > > > > a fence back. Needs to different flags probably.
> > > > >
> > > > > - Maybe add an ioctl to dma-bufs to get at the current implicit
> fences
> > > > >   attached to them (both an exclusive and non-exclusive version).
> This
> > > > >   should help with making explicit and implicit sync work together
> nicely.
> > > > >
> > > > > - Add fence support to kms. Probably only worth it together with
> the new
> > > > >   atomic stuff. Again we need an in fence to wait for (one for each
> > > > >   buffer) and an out fence. The later can easily be implemented by
> > > > >   extending struct drm_event, which means not a single driver code
> line
> > > > >   needs to be changed for this.
> > > > >
> > > > > - For de-staging android syncpts we need to de-clutter the internal
> > > > >   interfaces and also review all the ioctls exposed. Like you say
> it
> > > > >   should be just the userspace interface for struct drm_fence.
> Also, it
> > > > >   needs testcases and preferrably manpages.
> > > >
> > > > This all sounds very similar to what we'd like to do!  Maybe we can
> move
> > > > forward with these parts, and continue to attach fences at submit
> until we have
> > > > a satisfactory solution for the pinning problem?
> > >
> > > Yeah, that's our plan for i915 too. First add explicit fences, then
> figure
> > > out whether we need to be better at neutering the implicit fences, in
> case
> > > and only where it really gets in the way.
> > >
> > > > I'd like to understand what are the concrete steps to de-stage struct
> > > > sync_fence, since that's the first thing that needs to be done.  For
> example,
> > > > what do you mean by "de-cluttering the internal interfaces"?  Just
> that we'd
> > > > move the sync_fence parts from drivers/staging/android/sync.c to,
> say,
> > > > drivers/dma-buf/sync-fence.c ?  Would we still leave a copy of the
> existing
> > > > full driver to staging/android?
> > >
> > > Yeah I guess that would be an approach. Personally I think we should
> also
> > > have basic ioctl testcase for all the ioctls exposed by syncpt fds. And
> > > reviewing the kerneldoc for the driver-internal interfaces (which
> includes
> > > removing everything that's no made obsolete by struct fence). Bonus
> points
> > > for documenting the ioctls. We could throw the test binary into libdrm
> > > maybe, there's a bunch other like it already there.
> > >
> > > I'm not sure whether/how much google has already for this.
> >
> > The simplest way to add tests is if we allow user space to create and
> trigger
> > fences.  We don't want to enable that from kernel by default, because
> that
> > opens possibilities for deadlocks (e.g. a process could deadlock a
> compositor
> > by passing a fence that it never triggers).  Android sync driver solves
> this by
> > having a separate CONFIG_SW_SYNC_USER that can be used for testing.
> >
> > Here's a simple test by Google:
> >
> https://android.googlesource.com/platform/system/core/+/master/libsync/sync_test.c
>
> Hm, usually we expose such test interfaces through debugfs - that way
> production system won't ever ship with it (since there's too many exploits
> in there, especially with secure boot). But since you need it for
> validation tests (at least for the i915 suite) it should always be there
> when you need it.
>
> Exposing this as a configurable driver in dev is imo a no-go. But we
> should be able to easily convert this into a few debugfs files, so not too
> much fuzz hopefully.
>
> > And this is their userspace wrapper lib:
> >
> https://android.googlesource.com/platform/system/core/+/master/libsync/sync.c
> >
> https://android.googlesource.com/platform/system/core/+/master/libsync/include/sync/sync.h
> >
> > Would that go to libdrm now...?
>
> Imo it would make sense. At least we kinda want to use fences outside of
> Android, and a separate library project feels like overkill.
>
> > > Aside: Will you be at XDC or linux plumbers? Either would be a perfect
> > > place to discuss plans and ideas - I'll attend both.
> >
> > I wasn't going to, but let's see.  The former is pretty soon and the
> latter is
> > sold out.  At least Andy Ritger from Nvidia is coming to XDC for sure,
> and he's
> > been involved in our internal discussions around these topics. So I
> suggest you
> > have a chat with him at least!  :)
>
> I'll definitely have a chat (and some beers) with Andy, been a while I've
> last seen him ;-)
>
> Cheers, Daniel
> --
> Daniel Vetter
> Software Engineer, Intel Corporation
> +41 (0) 79 365 57 48 - http://blog.ffwll.ch
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.freedesktop.org/archives/dri-devel/attachments/20141003/2ec9b33e/attachment-0001.html>