[RFC v2] Wayland presentation extension (video protocol)

Mon Feb 17 05:12:30 PST 2014

On Mon, 17 Feb 2014 01:25:07 +0100
Mario Kleiner <mario.kleiner.de at gmail.com> wrote:

> Hello Pekka,
> 
> i'm not yet subscribed to wayland-devel, and a bit short on time atm., 
> so i'll take a shortcut via direct e-mail for some quick feedback for 
> your Wayland presentation extension v2.

Hi Mario,

I'm very happy to hear from you! I have seen your work fly by on
dri-devel@ (IIRC) mailing list, and when I was writing the RFCv2 email,
I was thinking whether I should personally CC you. Sorry I didn't. I
will definitely include you on v3.

I hope you don't mind me adding wayland-devel@ to CC, your feedback is
much appreciated and backs up my design nicely. ;-)

> I'm the main developer of a FOSS toolkit called Psychtoolbox-3 
> (www.psychtoolbox.org) which is used a lot by neuro-scientists for 
> presentation of visual and auditory stimuli, mostly for basic and 
> clinical brain research. As you can imagine, very good audio-video sync 
> and precisely timed and time stamped visual stimulus presentation is 
> very important to us. We currently use X11 + GLX + OML_sync_control 
> extension to present OpenGL rendered frames precisely on the classic 
> X-Server + DRI2 + DRM/KMS. We need frame accurate presentation and 
> sub-millisecond accurate presentation timestamps. For this reason i was 
> quite a bit involved in development and testing of DRI2/DRM/KMS 
> functionality related to page-flipping and time stamping.
> 
> One of the next steps for me will be to implement backends into 
> Psychtoolbox for DRI3/Present and for Wayland, so i was very happy when 
> i stumbled over your RFC for the Wayland presentation extension. Core 
> Wayland protocol seems to only support millisecond accurate 32-Bit 
> timestamps, which are not good enough for many neuro-science 
> applications. So first off, thank you for your work on this!

Right, and those 32-bit timestamps are not even guaranteed to be
related to any particular output refresh cycle wrt. what you have
presented before.

> Anyway, i read the mailing list thread and had some look over the 
> proposed patches in your git branch, and this mostly looks very good to 
> me :). I haven't had time to test any of this, or so far to play even 
> with the basic Wayland/Weston, but i thought i give you some quick 
> feedback from the perspective of a future power-user who will certainly 
> use this extension heavily for very timing sensitive applications.

Excellent!

> 1. Wrt. an additional "preroll_feedback" request 
> <http://lists.freedesktop.org/archives/wayland-devel/2014-January/013014.html>, 
> essentially the equivalent of glXGetSyncValuesOML(), that would be very 
> valuable to us.
> 
> To answer the question you had on #dri-devel on this: DRM/KMS implements 
> an optional driver hook that, if implemented by a KMS driver, allows to 
> get the vblank count and timestamps at any time with almost microsecond 
> precision, even if vblank irq's were disabled for extended periods of 
> time, without the need to wait for the next vblank irq. This is 
> implemented on intel-kms and radeon-kms for all Intel/AMD/ATI gpu's 
> since around October 2010 or Linux 2.6.35 iirc. It will be supported for 
> nouveau / NVidia desktop starting with the upcoming Linux 3.14. I have 
> verified with external high precision measurement equipment that those 
> timestamps for vblank and kms-pageflip completion are accurate down to 
> better than 20 usecs on those drivers.
> 
> If a kms driver doesn't implement the hook, as is currently the case for 
> afaik all embedded gpu drivers, then the vblank count will be reported 
> instantaneously, but the vblank timestamp will be reported as zero until 
> the first vblank irq has happened if irq's were turned off.
> 
> For reference:
> 
> http://lxr.free-electrons.com/source/drivers/gpu/drm/drm_irq.c#L883
> 

Indeed, the "preroll_feedback" request was modeled to match
glXGetSyncValuesOML.

Do you need to be able to call GetSyncValues at any time and have it
return ASAP? Do you call it continuously, and even between frames?

Or would it be enough for you to present a dummy frame, and just wait
for the presentation feedback as usual? Since you are asking, I guess
this is not enough.

We could take it even further if you need to monitor the values
continuously. We could add a protocol interface, where you could
subscribe to an event, per-surface or maybe per-output, whenever the
vblank counter increases. If the compositor is not in a continuous
repaint loop, it could use a timer to approximately sample the values
(ask DRM), or whatever. The benefit would be that we would avoid a
roundtrip for each query, as the compositor is streaming the events to
your app. Do you need the values so often that this would be worth it?

For special applications this would be ok, as in those cases power
consumption is not an issue.

> 2. As far as the decision boundary for your presentation target 
> timestamps. Fwiw, the only case i know of NV_present_video extension, 
> and the way i expose it to user code in my toolkit software, is that a 
> target timestamp means "present at that time or later, but never 
> earlier" ie., at the closest vblank with tVblank >= tTargetTime.
> 
> E.g., NV_present_video extension:
> https://www.opengl.org/registry/specs/NV/present_video.txt
> 
> Iow. presentation will be delayed somewhere between 0 and 1 refresh 
> duration wrt. the requested target time stamp tTargetTime, or on average 
> half a refresh duration late if one assumes totally uniform distribution 
> of target times. If i understand your protocol correctly, i would need 
> to add half a refresh duration to achieve the same semantic, ie. never 
> present earlier than requested. Not a big deal.
> 
> In the end it doesn't matter to me, as long as the behavior is well 
> specified in the protocol and consistent across all Wayland compositor 
> implementations.

Yes, you are correct, and we could define it either way. The
differences would arise when the refresh duration is not a constant.

Wayland presentation extension (RFC) requires the compositor to be able
to predict the time of presentation when it is picking updates from the
queue. The implicit assumption is that the compositor either makes its
predicted presentation time or misses it, but never manages to present
earlier than the prediction.

For dynamic refresh displays, I guess that means the compositor's
prediction must be optimistic: the earliest possible time, even if the
hardware usually takes longer.

Does it sound reasonable to you?

I wrote it this way, because I think the compositor would be in a
better position to guess the point of time for this screen update,
rather than clients working on old feedback.

Ah, but the "not before" vs. "rounded to" is orthogonal to
predicting or not the time of the next screen update.

I assume in your use cases the refresh duration will be required to be
constant for predictability, so it does not really matter whether we
pick "not before" or "rounding" semantics for the frame target time.

But for dynamic refresh, can you think of reasons to prefer one or the
other?

My reason for picking the "rounding" semantics is that an app can queue
many frames in advance, but it will only know the one current or
previous refresh duration. Therefore subtracting half of that from the
actually intended presentation times may become incorrect if the
duration changes in the future. When the compositor does the equivalent
extrapolation for each framebuffer flip it schedules, it has better
accuracy as it can use the most current refresh duration estimate for
each pick from the queue.

> 3. I'm not sure if this applies to Wayland or if it is already covered 
> by some other part of the Wayland protocol, as i'm not yet really 
> familiar with Wayland, so i'm just asking: As part of the present 
> feedback it would be very valuable for my kind of applications to know 
> how the present was done. Which backend was used? Was the presentation 
> performed by a kms pageflip or something else, e.g., some partial buffer 
> copy? The INTEL_swap_event extension for X11/GLX exposes this info to 
> classic X11/GLX clients. The reason is that drm/kms-pageflips have very 
> reliable and precise timestamps attached, so i know those timestamps are 
> trustworthy for very timing sensitive applications like mine. If page 
> flipping isn't used, but instead something like a (partial) framebuffer 
> copy/blit, at least on drm the returned timestamps are then not 
> reliable/trustworthy enough, as they could be off by up to 1 video 
> refresh duration. My software, upon detecting anything but a 
> page-flipped presentation, logs errors or even aborts a session, as 
> wrong timestamps or presentation timestamping could be a disaster for 
> those applications if it went unnoticed.
> 
> So this kind of feedback about how the present was executed would be 
> useful to us.

How presentation was executed is not exposed in the Wayland protocol in
any direct nor reliable way.

But rather than knowning how presentation was executed, it sounds like
you would really want to know how much you can trust the feedback,
right?

In Weston's case, there are many ways the presentation can be executed.

Weston-on-DRM, regardless of using the GL (hardware) or Pixman
(software) renderer, will possibly composite and always flip, AFAIU. So
this is both vsync'd and accurate by your standards.

Weston-on-fbdev (only Pixman renderer) is not vsync'd, and the
presentation timing is based on reading a software clock after an
arbitrary delay. IOW, the feedback and presentation is rubbish.

Weston-on-X11 is basically as useless as Weston-on-fbdev, in its current
implementation.

Weston-on-rpi (rpi-renderer) uses vsync'd flips to the best of our
knowledge, but can and will fall back to "firmware compositing" when
the scene is too complex for its "hardware direct scanout engine". We
do not know when it falls back. The presentation time is recorded by
reading a software clock (clock_gettime) in a thread that gets called
asynchronously by the userspace driver library on rpi. It should be
better than fbdev, but for your purposes still a black box of rubbish I
guess.

It seems like we could have several flags describing the aspects of
the presentation feedback or presentation itself:

1. vsync'd or not
2. hardware or software clock, i.e. DRM/KMS ioctl reported time vs.
   compositor calls clock_gettime() as soon as it is notified the screen
   update is done (so maybe kernel vs. userspace clock rather?)
3. did screen update completion event exist, or was it faked by an
   arbitrary timer
4. flip vs. copy?

Now, I don't know what implements the "buffer copy" update method you
referred to with X11, or what it actually means, because I can think of
two different cases:
a) copy from application buffer to the current framebuffer on screen,
   but timed so that it won't tear, or
b) copy from application buffer to an off-screen framebuffer, which is
   later flipped to screen.

Case b) is the normal compositing case that Weston-on-DRM does in GL and
Pixman renderers.

Weston can also flip the application buffer directly on screen if the
buffer is suitable and the scene graph allows; a decision it does on a
frame-by-frame basis. However, this flip does not imply that *all*
compositing would have been skipped. Maybe there is a hardware overlay
that was just right for one wl_surface, but other things are still
composited, i.e. copied.

I'm not sure what we should be distinguishing here. No matter how
weston updates the screen, the flags 1-3 would still be applicable and
fully describe the accuracy of presentation and feedback, I think.
However, the copy case a) is something I don't see accounted for. So
would the flag 4 be actually "copy case a) or not"?

The problem is that I do not know why the X11 framebuffer copy case you
describe would have inaccurate timestamps. Do you know where it comes
from, and would it make sense to describe it as a flag?

This system of flags was the first thing that came to my mind. Using
any numerical values like "the error in presentation timestamp is
always between [min, max]" seems very hard to implement in a
compositor, and does not even convey information like vsync or not.

I suppose a single flag "timestamp is accurate" does not cut it? :-)

What do you think?

Thanks,
pq