Future sync points (Re: Wayland and scheduling)

Mon Aug 11 04:52:26 PDT 2014

On Thu, 7 Aug 2014 16:24:23 -0700
"Brian C. Anderson" <brianderson at google.com> wrote:

> Responses inline. Feel free to cc wayland-devel, or I can start a new
> thread there.

Ok, I'm CC'ing wayland-devel, let's see if more interested people turn
up. :-)

Before anyone else replies here: please do read the Trustable Future
Sync Points document below, so you see what this is all about.

> On Thu, Aug 7, 2014 at 4:11 AM, Pekka Paalanen <ppaalanen at gmail.com> wrote:
> 
> > Hi,
> >
> > this is very interesting. Since you didn't write to the wayland-devel
> > mailing list, I suppose you are not yet ready to ask or propose this in
> > public?
> 
> 
> > I'm not really sure who else would have time to comment on this, so if
> > you feel comfortable, I'd suggest posting to wayland-devel at .
> >
> 
> The doc is already public, but I'm trying to get feedback/support before
> trying to push it more widely.  The doc has already gone through a few
> revisions, so I'd be happy to share it with wayland-devel.  If Wayland
> finds aspects interesting, I think it'll help convince committees and
> vendors that it's important, since there will be two parties wanting the
> features, rather than just one.
> 
> 
> >
> > Since we do not yet use any explicit synchronization in Weston other
> > than the basic Wayland core protocol, I am not too clear on what would
> > be solved by "normal" sync points and what needs future sync points.
> >
> > Unfortunately I am not familiar with the sync frameworks in Android or
> > coming to Linux DRM yet.
> >
> > I read the whole "Trustable Future Sync Points" document. Is your
> > implementation design assuming that everything will run within one
> > process?
> 
> 
> I tried not to assume much and want to support clients and services being
> different threads, processes, or even different hardware units.
> 
> 
> > Or at least that you have control over all the code that is
> > running?
> 
> 
> I want it to be able to work even for software that isn't trusted by the
> system.
> 
> 
> > Is each service a user space daemon (process) or thread?
> 
> 
> It could be, as Chrome's GPU service and Browser processes are, but ideally
> it would support kernel services and hardware services too eventually.

Very good. These questions were prompted by how the document was
written, the wording seems to imply some of these, especially when
there is not a single word about the kernel, IIRC.

> > More below.
> >
> > On Wed, 6 Aug 2014 14:25:46 -0700
> > "Brian C. Anderson" <brianderson at google.com> wrote:
> >
> > > Another way to look at it: Microsoft's DWM has fine-grained context
> > > preemption
> > > <
> > http://msdn.microsoft.com/en-us/library/windows/hardware/jj553428(v=vs.85).aspx
> > >
> > > and I'm trying to propose something more generic that can be used by the
> > > desktop compositor as well as applications.
> > >
> > >
> > > On Wed, Aug 6, 2014 at 2:14 PM, Brian C. Anderson <
> > brianderson at google.com>
> > > wrote:
> > >
> > > > Hi Pekka,
> > > >
> > > > For some more background on what I'm referring to, please see:
> > > >
> > > > Trustable Future Sync Points
> > > > <
> > https://docs.google.com/document/d/1qqu8c5Gp1faY-AY4CgER-GKs0w7GXlR5YJ-BaIZ4auo/edit?usp=sharing
> > >
> >
> > Yes... I can see several cases where we could use this, compared what we
> > currently have, although the sync framework Maarten Lankhorst has been
> > developing might solve some of our issues:
> >
> > - Sending a buffer from a client to the compositor before the
> >   client's GPU rendering to the buffer is complete is good for the
> >   latency, but:
> >
> >         a) on platforms with implicit syncing (Linux DRM with single
> >         gfx device) the client's rendering may delay the final page
> >         flip by arbitrary amounts, presumably causing the whole desktop
> >         to stutter.
> 
> 
> A client lowering it's priority, via DecreasePriorityUntil, could be
> implemented as a hint to defer a client's commands until the server wants
> to consume those commands, which would avoid a client producing two frames
> server-side before the compositor consumes the first server-side. See my
> discussion regarding a "correctness" vs "scheduling" sync point below.

A note to new readers: I think "server-side" here means whoever is
actually doing the work to signal a (future) sync point. So a Wayland
compositor is a client to the GPU (the kernel driver), the kernel
driver or even the GPU hardware itself being the server. A Wayland
client that is rendering, is a client to the GPU, too.

I hope I got that right.

http://en.wikipedia.org/wiki/Futures_and_promises linked from the
document was really helpful in getting the terminology.

> What does implicit syncing mean exactly. That a flush implies ordering
> across channels? That ordering is guaranteed without CPU intervention?

I think there are two things. One is that GPUs have traditionally been
executing tasks sequentially, which is the flush implies ordering case.
But I think that there also are in-kernel fences attached to command
batches or buffers, that are used to guarantee the ordering, rather
than relying on flushing alone. And these fences don't work
cross-device automatically. But this is just hearsay for me, I haven't
actually worked with them.

> >
> >         b) on platforms without implicit syncing (RPI, cross-GPU,
> >         V4L->GPU?), the compositor could be presenting incomplete
> >         content.
> >
> >   These problems could be relieved, if the compositor could inspect
> >   whether the buffer has completed or not, and use it in compositing
> >   only after it is complete. (Assuming the compositor can get its
> >   compositing job running regardless of clients' jobs.)
> >
> 
> SyncPointClientWaitUntilComplete should avoid (b) by allowing the
> compositor's *client side* to inspect whether the buffer has completed or
> not. If we have sync point shader arguments
> <https://docs.google.com/a/chromium.org/document/d/1GlnjZI0jDNPXZIlhdcU135BGnsP7T3WY0UR4IwjEILU/edit#heading=h.ux6vs3cl776r>
> and
> bindless textures
> <https://www.opengl.org/registry/specs/ARB/bindless_texture.txt>, then we
> could also allow the compositor to inspect whether to buffer is complete
> *server/hardware side* eventually.

Yes, the client-side inspection is what a compositor would need. I
think there is quite a lot of window system state that would be managed
outside of the GPU, and therefore could not be dealt server-side at
all. State, which we should be keeping in sync with what the user is
actually seeing.

> >
> > - We currently have buffer release (i.e. the compositor is done with
> >   the buffer, the client gets permission to re-use) baked into the
> >   Wayland core protocol, which is inconvenient for hardware accelerated
> >   stacks. If we used your (future) sync points, we could send a sync
> >   point to the client already when we know, that the next e.g. KMS
> >   atomic flip will release the buffer, instead of waiting after the
> >   next vblank.
> >
> >   IOW, we could send the buffer release (the sync point/object
> >   actually) ahead of time, and we would probably avoid the problems
> >   we currently have with the Wayland buffer release event. And the
> >   client might schedule more rendering against that buffer even before
> >   the vblank doing the release actually happens. Hmm, this seems *very*
> >   attractive for the Wayland protocol, if we just get the GBM and EGL
> >   APIs cooperate with it in the compositor.
> >
> 
> This is exactly one of the motivating use cases for Chrome. Future sync
> points would allow the Browser to send a buffer back to a Renderer before
> the Browser is actually done with it. Chrome can implement future sync
> points in its GPU Service, but it's probably not as efficient as a kernel
> implementation. I'm not sure if Weston can do something similar, if there's
> no driver support.

I would assume that driver support is required there.

I suppose a compositor could hand the application, say, a file
descriptor that the application can poll() ahead of time, but that
would still be all "manual" synchronization, not too different from the
wl_buffer.release event we have now (though even a file descriptor
would solve the major problem of wl_buffer.release by providing an
out-of-band immediate notification). Or maybe a futex in shared memory,
didn't X11 use something like that...

> > The nested embedders use case may apply on Wayland as well, as I have
> > heard that e.g. KWin might be running on top of Weston or some other
> > compositor that does the hardware/DRM specific bits, so that KWin can
> > concentrate working with just Wayland APIs. Another case is application
> > embedding, which on Wayland is thought to happen by the embedder being
> > a mini-Wayland-compositor that the embeddee connects to.
> >
> > Chaining... the Presentation extension we have designed (not merged yet)
> > for Wayland allows clients to submit buffers with target presentation
> > timestamps. Currently the compositor will have to run its repaint loop
> > to process the queues just in time for the next vblank. If we could
> > have working chaining as you planned, I could see use cases for pushing
> > the whole queue into gfx hardware. This would be useful e.g. for video
> > playback, when little else is happening at the same time.
> >
> > Could we be prioritizing clients' render jobs by having the compositor
> > do the actual scheduling and kick to the GPU by poking the sync point
> > objects? So that the compositor could actually decide which tasks in
> > which order get kicked to the GPUs, while GPU task pre-emption feature
> > is still lacking.
> >
> 
> Yes, future sync points would allow for exactly this. The original idea in
> an older doc
> <https://docs.google.com/a/chromium.org/document/d/1hjVckIpb9WBE7A9HUxAmutRJEgSkZO_JAFfgwOE-8NE/edit>
> was
> to have a "scheduling" sync point and "correctness" sync point that a
> client waits on. The "scheduling" sync point must be a future sync point to
> allow the compositor to defer its decision about when to schedule each
> child.
> 
> In the newer doc, I generalized the "scheduling" sync point to be a sync
> point the client can decrease it's priority on. The implementation can
> chose to implement that priority change as deferral of execution (as
> proposed in the old doc) or via pre-emption, if supported by the service.

Right, so a GPU would have run queue of tasks, and the order of the
tasks would depend on the priority of the sync point. And the
compositor could decrease the priority of any sync points it knows
about, to achieve certain GPU scheduling order?

Creating the sync point is still voluntary though, isn't it? So the GPU
driver would need a policy for tasks without explicit sync points.

I mean voluntary in the sense, that what an application is rendering,
might not even be intended for display at all.

> >
> > I didn't really understand enough to comment on your API design, but I
> > can say what I think Wayland would require:
> >
> > - The sync point manager would likely have to live in the kernel, no?
> >   We need to be able to pass (sync?) objects between processes...
> >
> 
> Yes, it would have to live somewhere that every client and service can
> trust, so the kernel would be a good place to centralize the knowledge of
> all sync points. Ideally hardware could support the API when there are just
> I/O devices talking to each other so we can avoid involving the CPU, but
> that wouldn't be a short-term goal.
> 
> 
> >
> > - ...and for doing that, we can use basic data types and file
> >   descriptors. Using file descriptors would probably solve a lot of
> >   refcounting problems, and guarantee security (no guessing of valid
> >   handles etc.).
> >
> 
> I'm not sure what's best here. I was also hoping a future sync point could
> be created without a going to the SyncPointManager. In Chrome, a round trip
> to our GPU Service takes .33ms on a fast machine, but we may be able to
> work around this some other way.

I'm not sure you can create new sync points without a roundtrip to the
kernel. The sync point manager needs to know about them, no? Or does it?

> I know Chrome runs out of file descriptors sometimes, so I'd be wary of an
> implementation that needs lots of file descriptors. A file descriptor per
> channel/context/timeline might be ok. Also, sync points / fences would
> ideally be something I/O devices can wait on if a platforms (such as HSA's
> signals/doorbells) supports that.
> 
> Regarding refcounting problems: We'd only need to manage the lifetime of a
> channel/context/timeline. A sync point would just be a particular value
> corresponding to a location on the timeline, rather than an object whose
> lifetime must be tracked.

Yes, that makes sense to me. Track a context with a file descriptor,
but per-context sync points are just integers from a monotonic sequence.

> Regarding security: SyncPointClientWaitUntilSchedulable or
> SyncPointClientWaitUntilComplete should be used whenever a client doesn't
> trust the sync point it gets from another client, which should cover any
> deadlocking issues. Are there other security issues to consider?

I can't think of any right now, but it's complex thing.

> >
> > - Clients are expected to submit their render jobs directly to the
> >   kernel, not go through a user space daemon. Only the buffer handles
> >   for presentation go through the daemon (the compositor), and we could
> >   start passing some sync-related objects, too.
> >
> 
> The API in the doc should support this.
> 
> 
> >
> > - Integration with not only GPUs, but also with other hardware blocks,
> >   like video decoders. These do not usually have (secure) user space
> >   daemons.
> >
> 
> The API tries to take this into account by having the concept of untrusted
> services. See AddTrustedService in the doc. If the service is not trusted,
> then it's sync points are not considered schedulable by
> SyncPointClientWaitUntilSchedulable until they are complete. There would be
> fewer pipeline bubbles if services can be trusted though.

I'm still quite hazy on what defines trustability and how would you
know... likely I just haven't spent enough time thinking about it.

> >
> > - The compositor, or the kernel, will likely not know if some
> >   apps/clients are low or high latency producers, unless we do some
> >   rough categorization: compositor's compositing low, hw video decoder
> >   low(?), a random app using GL high... and even still, the compositor
> >   may want to apply its own policy. E.g. in a IVI system, the client
> >   drawing the speedometer should be considered low-latency, while a
> >   video player would likely count as high-latency or untrusted. In the
> >   end, I think priorities should be controllable by the compositor or
> >   just by privileged processes, everything else getting the default
> >   low-priority/high-latency/untrusted treatment.
> >
> 
> I was hoping to stay away from static categories and instead rely on
> dynamic priorities instead. A client may be low or high latency depending
> on how many frames it has queued. If a video decoder has 0 frames queued
> because of a previous hiccup, it shouldn't stay low priority.

Certainly. Static categorization is domain specific knowledge, and
probably applied by a particular compositor. I think I was just a
little confused on where trust emerges from.

> >
> > - We cannot trust a Wayland client to cooperate, e.g. decrease its own
> >   priority or cancel something because it is going to block on something
> >   else, etc.
> >
> 
> I was imagining the dynamic priority *decreases* would occur behind the
> client's back in many cases, inside the client-side implementation of swap
> buffers for example. A client that doesn't decrease its priority isn't the
> end of the world, security wise.

Okay, though I probably would not trust the "client-side implementation
of swap buffers"... wait, client-side in which sense?

I just mean that there could be a different, say, libEGL.so in use than
we expect.

> 
> Dynamic priority *increases* use a different function, and only a
> higher-priority channel (the compositor's), can increase the priority of
> lesser channels. Dynamic priority increases do not require a cooperating or
> trusted client.

Very good.

> >
> > Btw. AFAIU user space signaled fences that GPUs could wait for were
> > proposed, and got shot down on dri-devel@, IIRC. Maybe they lacked a
> > proper plan to avoid indefinite waits and deadlocks, I don't know.
> >
> 
> I guess there are a few aspects to determine support for:
> 1) CPU->CPU
> 2) CPU->IO
> 3) IO->IO
> 4) IO->CPU

I have to say this is going over my head now. :-)

> 
> CPU->IO is the controversial one. In most cases, I don't expect IO to have
> *hard* dependencies on a CPU sync point unless it's from a trusted kernel
> service. I do imagine *soft* dependencies could be valuable if sync point
> shader arguments
> <https://docs.google.com/a/chromium.org/document/d/1GlnjZI0jDNPXZIlhdcU135BGnsP7T3WY0UR4IwjEILU/edit#heading=h.ux6vs3cl776r>
>  and bindless textures
> <https://www.opengl.org/registry/specs/ARB/bindless_texture.txt> are
> supported.
> 
> 
> > Anyway, this very interesting stuff, but I get the feeling that your
> > design might not suit Wayland in general if there is no kernel
> > managing the sync objects. What do you think?
> >
> 
> I agree that there needs to be kernel support, especially if the channel
> prioritization concepts ever want to play integrate with the kernel's
> thread scheduling.

Not only CPU threads, but I was thinking about all the GPU etc.
drivers, which do not have a user space daemon as a front-end.

Thanks,
pq

> > > >
> > > > Basic Ordering
> > > > <
> > https://docs.google.com/a/chromium.org/document/d/1qqu8c5Gp1faY-AY4CgER-GKs0w7GXlR5YJ-BaIZ4auo/edit?pli=1#bookmark=id.jlpx882gp8op
> > >
> > > > Deadlock Avoidance and Trust
> > > > <
> > https://docs.google.com/a/chromium.org/document/d/1qqu8c5Gp1faY-AY4CgER-GKs0w7GXlR5YJ-BaIZ4auo/edit?pli=1#bookmark=id.7dnpwpwfez0s
> > >
> > > > Scheduling and Priorities
> > > > <
> > https://docs.google.com/a/chromium.org/document/d/1qqu8c5Gp1faY-AY4CgER-GKs0w7GXlR5YJ-BaIZ4auo/edit?pli=1#bookmark=id.lu446lhx5av>
> > <---
> > > > this section in particular
> > > >
> > > > Sync Point Shader Arguments
> > > > <
> > https://docs.google.com/document/d/1GlnjZI0jDNPXZIlhdcU135BGnsP7T3WY0UR4IwjEILU/edit?usp=sharing
> > >
> > > >
> > > > Thanks!
> > > > Brian