[RFC v2] Wayland presentation extension (video protocol)

Pekka Paalanen ppaalanen at gmail.com
Thu Feb 20 03:07:25 PST 2014

Hi Mario,

I have replies to your comments below, but while reading what you said,
I started wondering whether Wayland would be good for you after all.

It seems that your timing sensitive experiment programs and you and
your developers use a great deal of effort into
- detecting the hardware and drivers,
- determining how the display server works, so that you can
- try to make it do exactly what you want, and
- detect if it still does not do exactly like you want it and bail,
  while also
- trying to make sure you get the right timing feedback from the kernel

Sounds like the display server is a huge source of problems to you, but
I am not quite sure how running on top a display server benefits you.
Your experiment programs want to be in precise control, get accurate
timings, and they are always fullscreen. Your users / test subjects
never switch away from the program while it's running, you don't need
windowing or multi-tasking, AFAIU, nor any of the application
interoperability features that are the primary features of a display

Why not take the display server completely out of the equation?

I understand that some years ago, it would probably not have been
feasible and X11 was the de facto interface to do any graphics.

However, it seems you are already married to DRM/KMS so that you get
accurate timing feedback, so why not port your experiment programs
(the framework) directly on top of DRM/KMS instead of Wayland?

With Mesa EGL and GBM, you can still use hardware accelerated openGL if
you want to, but you will also be in explicit control of when you
push that rendered buffer into KMS for display. Software rendering by
direct pixel poking is also possible and at the end you just push that
buffer to KMS as usual too. You do not need any graphics card specific
code, it is all abstracted in the public APIs offered by Mesa and
libdrm, e.g. GBM. The new libinput should make hooking into input
devices much less painful, etc. All this thanks to Wayland, because on
Wayland, there is no single "the server" like the X.org X server is.
There will be lots of servers and each one needs the same
infrastructure you would need to run without a display server.

No display server obfuscating your view to the hardware, no
compositing manager fiddling with your presentation, and most likely no
random programs hogging the GPU at random times. Would the trade-off
not be worth it?

Of course your GUI tools and apps could continue using a display server
and would probably like to be ported to be Wayland compliant, I'm just
suggesting this for the sensitive experiment programs. Would this be
possible for your infrastructure?

On Thu, 20 Feb 2014 04:56:02 +0100
Mario Kleiner <mario.kleiner.de at gmail.com> wrote:

> On 17/02/14 14:12, Pekka Paalanen wrote:
> > On Mon, 17 Feb 2014 01:25:07 +0100
> > Mario Kleiner <mario.kleiner.de at gmail.com> wrote:
> >
> >> Hello Pekka,
> >>
> >> i'm not yet subscribed to wayland-devel, and a bit short on time atm.,
> >> so i'll take a shortcut via direct e-mail for some quick feedback for
> >> your Wayland presentation extension v2.
> >
> > Hi Mario,
> >
> > I'm very happy to hear from you! I have seen your work fly by on
> > dri-devel@ (IIRC) mailing list, and when I was writing the RFCv2 email,
> > I was thinking whether I should personally CC you. Sorry I didn't. I
> > will definitely include you on v3.
> >
> > I hope you don't mind me adding wayland-devel@ to CC, your feedback is
> > much appreciated and backs up my design nicely. ;-)
> >
> Hi again,
> still not subscribed, but maybe the server accepts the e-mail to 
> wayland-devel anyway as i'm subscribed to other xorg lists? I got an 
> "invalid captcha" error when trying to subscribe, despite no captcha was 
> ever presented to me? You may need to cc wayland-devel again for me.

I guess the web interface is still down. Emailing
wayland-devel-subscribe at lists.freedesktop.org should do, if I recall the
address correctly. And someone might process the moderation queue, too.

No worries anyway, I won't cut anything from you out, so it's all
copied below.

> >> 1. Wrt. an additional "preroll_feedback" request
> >> <http://lists.freedesktop.org/archives/wayland-devel/2014-January/013014.html>,
> >> essentially the equivalent of glXGetSyncValuesOML(), that would be very
> >> valuable to us.
> >>
> ...
> >
> > Indeed, the "preroll_feedback" request was modeled to match
> > glXGetSyncValuesOML.
> >
> > Do you need to be able to call GetSyncValues at any time and have it
> > return ASAP? Do you call it continuously, and even between frames?
> >
> Yes, at any time, even between frames, with a return asap. This is 
> driven by the specific needs of user code, e.g., to poll for a vblank, 
> or to establish a baseline of current (msc, ust) for timing stuff 
> relative. Psychtoolbox is an extension to a scripting language, so 
> usercode often decides how this is used.
> Internally to the toolkit these calls are used on X11/GLX to translate 
> target timestamps into target vblank counts for glXSwapBufferMscOML(), 
> because OML_sync_control is based on vblank counts, not absolute system 
> time, as you know, but ptb exposes an api where usercode specifies 
> target timestamps, like in your protocol proposal. The query is also 
> needed to work around some problems with the blocking nature of the X11 
> protocol when one tries to swap multiple independent windows with 
> different rates and uses glXWaitForSbcOML to wait for swap completion. 
> E.g., what doesn't work on X11 is using different x-display connections 
> - one per windows - and create GLX contexts which share resources across 
> those connections, so if you need multi-window operation you have to 
> create all GLX contexts on the same x-display connection and run all glx 
> calls over that connection. If you run multiple independent animations 
> in different windows you have to avoid blocking that x-connection, so i 
> use glXGetSyncValuesOML to poll current msc and ust to find out when it 
> is safe to do a blocking call.
> I hope that the way Waylands protocol works will make some of these 
> hacks unneccessary, but user script code can call glXGetSyncValues at 
> any time.

Wayland won't allow you to use different connections even harder,
because there simply are no sharable references to protocol objects. But
OTOH, Wayland only provides you a low-level direct async interface to
the protocol as a library, so you will be doing all the blocking in your
app or GUI-toolkit.

Sounds like we will need the "subscribe to streaming vblank events
interface" then. The recommended usage pattern would be to subscribe
only when needed, and unsubscribe ASAP.

There is one detail, though. Presentation timestamp is defined as "turns
to light", not vblank. If the compositor knows about monitor latency, it
will add this time to the presentation timestamp. To keep things
consistent, we'd need to define it as a stream of turned-to-light

> > Or would it be enough for you to present a dummy frame, and just wait
> > for the presentation feedback as usual? Since you are asking, I guess
> > this is not enough.
> Wouldn't be enough in all use cases. Especially if i want to synchronize 
> other modalities like sound or digital i/o, it is nice to not add 
> overhead or complexity on the graphics side for things not related to 
> rendering.
> >
> > We could take it even further if you need to monitor the values
> > continuously. We could add a protocol interface, where you could
> > subscribe to an event, per-surface or maybe per-output, whenever the
> > vblank counter increases. If the compositor is not in a continuous
> > repaint loop, it could use a timer to approximately sample the values
> > (ask DRM), or whatever. The benefit would be that we would avoid a
> > roundtrip for each query, as the compositor is streaming the events to
> > your app. Do you need the values so often that this would be worth it?
> >
> > For special applications this would be ok, as in those cases power
> > consumption is not an issue.
> I could think of cases where this could be useful to have per-output if 
> it is somewhat accurate. E.g., driving shutter glasses or similar 
> mechanisms in sync with refresh, or emitting digital triggers or network 
> packets at certain target vblank counts. Neuro-Scientists often have to 
> synchronize stimulation or recording hardware to visual stimulus 
> presentation or simply log certain events, e.g., start of refresh 
> cycles, so it is possible synchronize the timing of different devices. I 
> hoped to use glXWaitForMscOML() for this in the past, but it didn't work 
> out because of the constraint that everything would have to go over one 
> x-display connection and that connection wasn't allowed to block for 
> more than fractions of a millisecond.

I really do hope something like driving shutter glasses would be done
by the kernel drivers, but I guess we're not there yet. Otherwise it'd
be the compositor, but we don't have Wayland protocol for it yet,
either. I have seen some discussion fly by about TV 3D modes.

> On drm/kms, queuing a vblank event would do the trick without the need 
> for Wayland to poll for vblanks.
> Of course i don't know how much use would there be for such 
> functionality outside of my slightly exotic application domain, except 
> maybe for "things to do with stereo shutter-glasses", like NVidia's 
> NVision stuff.
> >> 2. As far as the decision boundary for your presentation target
> >> timestamps. Fwiw, the only case i know of NV_present_video extension,
> >> and the way i expose it to user code in my toolkit software, is that a
> >> target timestamp means "present at that time or later, but never
> >> earlier" ie., at the closest vblank with tVblank >= tTargetTime.
> >>
> >> E.g., NV_present_video extension:
> >> https://www.opengl.org/registry/specs/NV/present_video.txt
> >>
> >> Iow. presentation will be delayed somewhere between 0 and 1 refresh
> >> duration wrt. the requested target time stamp tTargetTime, or on average
> >> half a refresh duration late if one assumes totally uniform distribution
> >> of target times. If i understand your protocol correctly, i would need
> >> to add half a refresh duration to achieve the same semantic, ie. never
> >> present earlier than requested. Not a big deal.
> >>
> >> In the end it doesn't matter to me, as long as the behavior is well
> >> specified in the protocol and consistent across all Wayland compositor
> >> implementations.
> >
> > Yes, you are correct, and we could define it either way. The
> > differences would arise when the refresh duration is not a constant.
> >
> > Wayland presentation extension (RFC) requires the compositor to be able
> > to predict the time of presentation when it is picking updates from the
> > queue. The implicit assumption is that the compositor either makes its
> > predicted presentation time or misses it, but never manages to present
> > earlier than the prediction.
> >
> Makes sense.
> > For dynamic refresh displays, I guess that means the compositor's
> > prediction must be optimistic: the earliest possible time, even if the
> > hardware usually takes longer.
> >
> > Does it sound reasonable to you?
> >
> Yes, makes sense to me. For my use case, presenting a frame too late is 
> ok'ish, as that is the expected behaviour - on time or delayed, but 
> never earlier than wanted. So an optimistic guess for dynamic refresh 
> would be the right thing to do when picking a frame for presentation 
> from the queue.
> But i don't have any experience with dynamic refresh displays, so i 
> don't know how they would behave in practice or where you'd find them? I 
> read about NVidia's G-Sync stuff lately and about AMD's Freesync 
> variant, but are there any other solutions available? I know that some 
> (Laptop?) panels have the option for a lower freq. self-refresh? How do 
> they behave when there's suddenly activity from the compositor?

You know roughly as much as I do. You even recalled the name
Freesync. :-)

I heard Chrome(OS?) is doing some sort of dynamic scanout rate
switching based on update rates or something, so that would count as
dynamic refresh too, IMO.

> > I wrote it this way, because I think the compositor would be in a
> > better position to guess the point of time for this screen update,
> > rather than clients working on old feedback.
> >
> > Ah, but the "not before" vs. "rounded to" is orthogonal to
> > predicting or not the time of the next screen update.
> >
> > I assume in your use cases the refresh duration will be required to be
> > constant for predictability, so it does not really matter whether we
> > pick "not before" or "rounding" semantics for the frame target time.
> >
> Yes. Then i can always tweak my target timestamps to get the behavior i 
> want. For your current proposal i'd add half a refresh duration to what 
> usercode provides me, because the advice for my users is to provide 
> target timestamps which are half a refresh before the target vblank for 
> maximum robustness. This way your "rounding" semantics would turn into 
> the "not before" semantics i need to expose to my users.

Exactly, and because you aim for constant refresh rate outputs, there
should be no problems. Your users depend on constant refresh rate
already, so that they actually can compute the timestamp according to
your advice.

> I can also always feed just single frames into your presentation queue 
> and wait for present_feedback for those single frames, so i can be sure 
> the "proper" frame was presented.

Making that work so that you can actually hit every scanout cycle with
a new image is something the compositor should implement, yes, but the
protocol does not guarantee it. I suspect it would work in practise

But, this might be a reason to have the "frame callback" be queueable.
We have just had lots of discussion about what do we do with the frame
callback, what it means, and how it should work. It's not clear yet.

> Ideally i could select when queuing a frame, if the compositor will skip 
> the frame if it is late, that is, when frames with a later target 
> presentation time are already queued, which are closer to the predicted 
> presentation time. Your current design is optimized for video playback, 
> where you want to drop late frames to maintain audio-video sync. I have 
> both cases. For video playback and periodic/smooth animations i want to 
> drop frames which are late, but for some other purposes i need to be 
> sure that although no frame is ever presented too early, all frames are 
> presented in sequence without dropping any, even if that increases the 
> mean error of target time vs. real presentation time.
> Things like this stuff: 
> http://link.springer.com/article/10.3758/s13414-013-0605-z
> Such studies need to present sequences of potentially dozens of images 
> in rapid succession, without dropping a single frame. Duplicating a 
> frame is not great, but not as bad as dropping one, because certain 
> frames in such sequences may have a special meaning.
> This can also be achieved by my client just queuing one frame for 
> presentation and waiting for present feedback before queuing the next 
> one. I would lose the potential robustness advantage of queuing in the 
> server instead of one round trip away in my client. But that's how it's 
> done atm. with X11/GLX/DRI2 and on other operating systems because they 
> don't have presentation queues.

A flag for "do not skip", that sounds simpler than I thought. I've
heard that QtQuick would want no-skip for whatever reason, too, and so
far thought posting one by one rather than queueing in advance would do
it, since I didn't understand the use case. You gave another one.

> > But for dynamic refresh, can you think of reasons to prefer one or the
> > other?
> >
> > My reason for picking the "rounding" semantics is that an app can queue
> > many frames in advance, but it will only know the one current or
> > previous refresh duration. Therefore subtracting half of that from the
> > actually intended presentation times may become incorrect if the
> > duration changes in the future. When the compositor does the equivalent
> > extrapolation for each framebuffer flip it schedules, it has better
> > accuracy as it can use the most current refresh duration estimate for
> > each pick from the queue.
> >
> The "rounding" semantics has the problem that if the client doesn't want 
> it, as in my case, it needs to know the refresh duration to counteract 
> your rounding by shifting the timestamps. On a dynamic refresh display i 
> wouldn't know how to do that. But this is basically your thinking 
> backwards, for a different definition of robust.

Right. When you have to set time based on an integer frame counter, it
is hard to reason about when it actually is. The only possibility is to
use the "not before" definition, because you just cannot subtract half
with integers, and the counter ticks when the buffer flips, basically.

But with nanosecond timestamps we actually have a choice.

Since the definition is turns-to-light and not vblank, I still think
rounding is more appropriate, since it is the closest we can do for a
request to "show this at time t". But yes, I agree the wanted behaviour
may depend on the use case.

> As far as i'm concerned i'd tell my users that if they want it to work 
> well enough on a dynamic refresh display, they'd have to provide 
> different target timestamps to accommodate the different display 
> technology - in these cases ones that are optimized for your "rounding" 
> semantics.
> So i think your proposal is fine. My concern is mostly to not get any 
> surprises on regular common fixed refresh rate displays, so that user 
> code written many years ago doesn't suddenly behave in weird and 
> backwards incompatible ways, even on exactly the same hardware, just 
> because they run it on Wayland instead of X11 or on some other operating 
> system. Having to rewrite dozens of their scripts wouldn't be the 
> behavior my users would expect after upgrading to some future 
> distribution with Wayland as display server ;-)

Keeping an eye on such concerns is exactly where I need help. :-)

> >> 3. I'm not sure if this applies to Wayland or if it is already covered
> >> by some other part of the Wayland protocol, as i'm not yet really
> >> familiar with Wayland, so i'm just asking: As part of the present
> >> feedback it would be very valuable for my kind of applications to know
> >> how the present was done. Which backend was used? Was the presentation
> >> performed by a kms pageflip or something else, e.g., some partial buffer
> >> copy? The INTEL_swap_event extension for X11/GLX exposes this info to
> >> classic X11/GLX clients. The reason is that drm/kms-pageflips have very
> >> reliable and precise timestamps attached, so i know those timestamps are
> >> trustworthy for very timing sensitive applications like mine. If page
> >> flipping isn't used, but instead something like a (partial) framebuffer
> >> copy/blit, at least on drm the returned timestamps are then not
> >> reliable/trustworthy enough, as they could be off by up to 1 video
> >> refresh duration. My software, upon detecting anything but a
> >> page-flipped presentation, logs errors or even aborts a session, as
> >> wrong timestamps or presentation timestamping could be a disaster for
> >> those applications if it went unnoticed.
> >>
> >> So this kind of feedback about how the present was executed would be
> >> useful to us.
> >
> > How presentation was executed is not exposed in the Wayland protocol in
> > any direct nor reliable way.
> >
> > But rather than knowning how presentation was executed, it sounds like
> > you would really want to know how much you can trust the feedback,
> > right?
> >
> Yes. Which means i also need to know how much i can trust the feedback 
> about how much i can trust the feedback ;-).
> Therefore i prefer low level feedback about what is actually happening, 
> so i can look at the code myself and also sometimes perform testing to 
> find out, which presentation methods are trustworthy and which not. Then 
> have some white-lists or black-lists of what is ok and what is not.
> > In Weston's case, there are many ways the presentation can be executed.
> >
> > Weston-on-DRM, regardless of using the GL (hardware) or Pixman
> > (software) renderer, will possibly composite and always flip, AFAIU. So
> > this is both vsync'd and accurate by your standards.
> >
> That's good. This is the main use case, so knowing i use the drm backend 
> would basically mean everything's fine and trustworthy.
> Most of my users applications run as unoccluded, undecorated fullscreen 
> windows, just filling one or multiple outputs, so they use page-flipping 
> on X11 as well. For windowed presentation i couldn't make any 
> guarantees, but if Wayland always composes a full framebuffer and flips 
> in such a case then timestamps should be trustworthy.


> > Weston-on-fbdev (only Pixman renderer) is not vsync'd, and the
> > presentation timing is based on reading a software clock after an
> > arbitrary delay. IOW, the feedback and presentation is rubbish.
> >
> > Weston-on-X11 is basically as useless as Weston-on-fbdev, in its current
> > implementation.
> >
> Probably both not relevant for me.
> > Weston-on-rpi (rpi-renderer) uses vsync'd flips to the best of our
> > knowledge, but can and will fall back to "firmware compositing" when
> > the scene is too complex for its "hardware direct scanout engine". We
> > do not know when it falls back. The presentation time is recorded by
> > reading a software clock (clock_gettime) in a thread that gets called
> > asynchronously by the userspace driver library on rpi. It should be
> > better than fbdev, but for your purposes still a black box of rubbish I
> > guess.
> >
> There were a few requests for Rasperry Pi, but its hardware is mostly 
> too weak for our purposes. So also not a big deal for my purpose.

Well, it depends. If you have only ever seen the X desktop on RPi, then
you have not seen what it can actually do.

DispmanX (the proprietary display API) can easily allow you to flip
full-HD buffers at framerate, if you first have enough memory to
preload them. There is a surprisingly lot of power compared to the
"size", but it's all behind proprietary interfaces.

FWIW, Weston on rpi uses DispmanX, and compared to X, it flies.

> > It seems like we could have several flags describing the aspects of
> > the presentation feedback or presentation itself:
> >
> > 1. vsync'd or not
> > 2. hardware or software clock, i.e. DRM/KMS ioctl reported time vs.
> >     compositor calls clock_gettime() as soon as it is notified the screen
> >     update is done (so maybe kernel vs. userspace clock rather?)
> > 3. did screen update completion event exist, or was it faked by an
> >     arbitrary timer
> > 4. flip vs. copy?
> >
> Yes, that would be sufficient for my purpose. I can always get the 
> OpenGL renderer/vendor string to do some basic checks for which kms 
> driver is in use, and your flags would give me all needed dynamic 
> information to judge reliability.

Except that the renderer info won't help you, if the machine has more
than one GPU. It is quite possible to render on one GPU and scan out
on another.

> > Now, I don't know what implements the "buffer copy" update method you
> > referred to with X11, or what it actually means, because I can think of
> > two different cases:
> > a) copy from application buffer to the current framebuffer on screen,
> >     but timed so that it won't tear, or
> > b) copy from application buffer to an off-screen framebuffer, which is
> >     later flipped to screen.
> >
> > Case b) is the normal compositing case that Weston-on-DRM does in GL and
> > Pixman renderers.
> >
> Case a) was the typical case on classic X11 for swaps of non-fullscreen 
> windows. Case b) is fine. On X11 + old compositors there was no suitable 
> protocol in place. The Xserver/ddx would copy to some offscreen buffer 
> and the compositor would do something with that offscreen buffer, but 
> timestamps would refer to when the offscreen copy was scheduled to 
> happen, not when the final composition showed on the screen. Thereby the 
> timestamps were overly optimistic and useless for my purpose, only good 
> enough for some throttling.
> > Weston can also flip the application buffer directly on screen if the
> > buffer is suitable and the scene graph allows; a decision it does on a
> > frame-by-frame basis. However, this flip does not imply that *all*
> > compositing would have been skipped. Maybe there is a hardware overlay
> > that was just right for one wl_surface, but other things are still
> > composited, i.e. copied.
> Given that my app mostly renders to opaque fullscreen windows, i'd 
> expect Weston to probably just flip its buffer to the screen.

Indeed, if you use EGL. At the moment we have no other portable ways to
post suitable buffers.

> >
> > I'm not sure what we should be distinguishing here. No matter how
> > weston updates the screen, the flags 1-3 would still be applicable and
> > fully describe the accuracy of presentation and feedback, I think.
> > However, the copy case a) is something I don't see accounted for. So
> > would the flag 4 be actually "copy case a) or not"?
> >
> Yes, if Weston uses page-flipping in the end on drm/kms and the returned 
> timestamp is the page-flip completion timestamp from the kernel, then it 
> doesn't matter if copies were involved somewhere.
> > The problem is that I do not know why the X11 framebuffer copy case you
> > describe would have inaccurate timestamps. Do you know where it comes
> > from, and would it make sense to describe it as a flag?
> >
> That case luckily doesn't seem to exist on Weston + drm/kms, if i 
> understand you correctly - and the code in drm_output_repaint(), if i 
> look at the right bits?

You're right on Weston. It still does not prevent someone else writing
a compositor that does the just-in-time copy into a live framebuffer
trick, but if someone does that, I would expect them to give accurate
presentation feedback nevertheless. There is no excuse to give
anything else.

FYI, while Weston is the reference compositor, there is no "the
server". Every major DE is writing their own compositor-servers. Most
will run on DRM/KMS directly, and I think one has chosen to run only on
top of another Wayland compositor.

> Old compositors like, e.g., Compiz, would do things like for partial 
> screen updates only recompose the affected part of the backbuffer and 
> then use functions like glXCopySubBufferMESA() or glCopyPixels() to copy 
> the affected backbuffer area to the frontbuffer, maybe after waiting for 
> vblank before executing the command.
> Client calls glXSwapBuffer/glXSwapBufferMscOML -> server/ddx queues 
> vblank event in kernel -> kernel delivers vblank event to server/ddx at 
> target msc count -> ...random delay... -> ddx copies client window 
> backbuffer pixmap to offscreen surface and posts damage and returns the 
> vblank timestamp of the triggering vblank event as swap completion 
> timestamp to the client -> ...random delay... -> compositor does its 
> randomly time consuming thing which eventually leads to an update of the 
> display.

I regret I asked. ;-)

> So in any case the returned timestamp from the vblank event delivered by 
> the kernel was always an (overly) optimistic one, also because all the 
> required rendering and copy operations would be queued through the gpu 
> command stream where they might get delayed by other pending rendering 
> commands, or maybe the gpu waiting for the scanout being outside the 
> target area for a copy to avoid tearing.
> Without a compositor it worked the same for windowed windows, just that 
> the ddx queued some blit (i think via exa or uxa or maybe now sna on 
> intel iirc) into the gpu's command stream, with the same potential 
> delays due to queued commands in the command stream.
> Only if the compositor was unredirecting fullscreen windows, or no 
> compositor was there at all + fullscreen window, would the client get 
> kms page-flipping and the reliable page flip completion timestamps from 
> the kernel.
> > This system of flags was the first thing that came to my mind. Using
> > any numerical values like "the error in presentation timestamp is
> > always between [min, max]" seems very hard to implement in a
> > compositor, and does not even convey information like vsync or not.
> >
> Indeed. I would have trust issues with that.
> > I suppose a single flag "timestamp is accurate" does not cut it? :-)
> >
> > What do you think?
> >
> Yes, the flags sound good to me. As long as i know i'm on a drm/kms 
> backend they should be all i need.

That... is a problem. :-)

From your feedback so far, I think you have only requested additional
- ability to subscribe to a stream of vblank-like events
- do-not-skip flag for queued updates

I will think about those.


More information about the wayland-devel mailing list