[RFC v2] Wayland presentation extension (video protocol)

Thu Feb 20 21:40:02 PST 2014

On 20/02/14 12:07, Pekka Paalanen wrote:
> Hi Mario,
>

Ok, now i am magically subscribed. Thanks to the moderator!

> I have replies to your comments below, but while reading what you said,
> I started wondering whether Wayland would be good for you after all.
>
> It seems that your timing sensitive experiment programs and you and
> your developers use a great deal of effort into
> - detecting the hardware and drivers,
> - determining how the display server works, so that you can
> - try to make it do exactly what you want, and
> - detect if it still does not do exactly like you want it and bail,
>    while also
> - trying to make sure you get the right timing feedback from the kernel
>    unmangled.
>

Yes. It's "trust buf verify". If i know that the api / protocol is well 
defined and suitable for my purpose and have verified that at least the 
reference compositor implements the protocol correctly then i can at 
least hope that all other compositors are also implemented correctly, so 
stuff should work as expected. And i can verify that at least some 
subset of compositors really works, and try to submit bug reports or 
patches if they don't.

> Sounds like the display server is a huge source of problems to you, but
> I am not quite sure how running on top a display server benefits you.
> Your experiment programs want to be in precise control, get accurate
> timings, and they are always fullscreen. Your users / test subjects
> never switch away from the program while it's running, you don't need
> windowing or multi-tasking, AFAIU, nor any of the application
> interoperability features that are the primary features of a display
> server.
>

They are fullscreen and timing sensitive in probably 95% of all typical 
application cases during actual "production use" while experiments are 
run. But some applications need the toolkit to present in regular 
windows and GUI thingys, a few even need compositing to combine my 
windows with windows of other apps. Some setups run multi-display, where 
some displays are used for fullscreen stimulus presentation to the 
tested person, but another display may be used for control/feedback or 
during debugging by the experimenter, in which case the regular desktop 
GUI and UI of the scripting environment is needed on that display. One 
popular case during debugging is having a half-transparent fullscreen 
window for stimulus presentation, but behind that window the whole 
regular GUI with the code editor and debugger in the background, so one 
can set breakpoints etc. - The window is made transparent for mouse and 
keyboard input, so users can interact with the editor.

So in most cases i need a display server running, because i sometimes 
need compositing and i often need a fully functional GUI during at the 
at least 50% of the work time where users are debugging and testing 
their code and also don't want to be separated from their e-mail clients 
and web browsers etc. during that time.

> Why not take the display server completely out of the equation?
>
> I understand that some years ago, it would probably not have been
> feasible and X11 was the de facto interface to do any graphics.
>
> However, it seems you are already married to DRM/KMS so that you get
> accurate timing feedback, so why not port your experiment programs
> (the framework) directly on top of DRM/KMS instead of Wayland?
>

Yes and no. DRM/KMS will be the most often used one and is the best bet 
if i need timing control and it's the one i'm most familiar with. I also 
want to keep the option of running on other backends if timing is not of 
much importance, or if it can be improved on them, should the need arise.

> With Mesa EGL and GBM, you can still use hardware accelerated openGL if
> you want to, but you will also be in explicit control of when you
> push that rendered buffer into KMS for display. Software rendering by
> direct pixel poking is also possible and at the end you just push that
> buffer to KMS as usual too. You do not need any graphics card specific
> code, it is all abstracted in the public APIs offered by Mesa and
> libdrm, e.g. GBM. The new libinput should make hooking into input
> devices much less painful, etc. All this thanks to Wayland, because on
> Wayland, there is no single "the server" like the X.org X server is.
> There will be lots of servers and each one needs the same
> infrastructure you would need to run without a display server.
>
> No display server obfuscating your view to the hardware, no
> compositing manager fiddling with your presentation, and most likely no
> random programs hogging the GPU at random times. Would the trade-off
> not be worth it?
>

I thought about EGL/GBM etc. as a last resort for especially demanding 
cases, timing-wise. But given that the good old X-Server was good enough 
for almost everything so far, i'd expect Wayland to perform as good as 
or better timing-wise. If that turns out to be true, it would be good 
enough for hopefully almost all use cases, with all the benefits of 
compositing and GUI support when needed, and not having to reimplement 
my own display server. E.g., i'm also using GStreamer as media backend 
(still 0.10 though) and while there is Wayland integration, i doubt 
there will be Psychtoolbox integration. Or things like Optimus-style 
hybrid graphics laptops with one gpu rendering, the other gpu 
displaying. I assume Wayland will tackle this stuff at some point, 
whereas i'm not too keen to potentially learn and tackle all the ins and 
outs of rendernodes or dmabuf juggling.

> Of course your GUI tools and apps could continue using a display server
> and would probably like to be ported to be Wayland compliant, I'm just
> suggesting this for the sensitive experiment programs. Would this be
> possible for your infrastructure?
>

Not impossible when absolutely needed, just inconvenient for the user + 
yet another display backend to maintain and test for me. Psychtoolbox is 
a set of extensions for both GNU/Octave and Mathworks Matlab - both use 
the same scripting language, so users can choose which to use and have 
portable code between them. Matlab uses a GUI based on Java/Awt/Swing on 
top of X11, whereas Octave just gained a QT-based GUI beginning this 
year. You can run both apps in a terminal or console, but most of my 
users are mostly psychologists/neuro-biologists/physicians, and most of 
them only have basic programming skills and are somewhat frightened by 
command line environments. Most would touch a non-GUI environment only 
in moments of highest despair, let alone learn how to switch between a 
GUI environment and a console.

>
> On Thu, 20 Feb 2014 04:56:02 +0100
> Mario Kleiner <mario.kleiner.de at gmail.com> wrote:
>
>> On 17/02/14 14:12, Pekka Paalanen wrote:
>>> On Mon, 17 Feb 2014 01:25:07 +0100
>>> Mario Kleiner <mario.kleiner.de at gmail.com> wrote:
>>>
>>>> Hello Pekka,
>>>>
>>>> i'm not yet subscribed to wayland-devel, and a bit short on time atm.,
>>>> so i'll take a shortcut via direct e-mail for some quick feedback for
>>>> your Wayland presentation extension v2.
>>>
>>> Hi Mario,
>>>
>>> I'm very happy to hear from you! I have seen your work fly by on
>>> dri-devel@ (IIRC) mailing list, and when I was writing the RFCv2 email,
>>> I was thinking whether I should personally CC you. Sorry I didn't. I
>>> will definitely include you on v3.
>>>
>>> I hope you don't mind me adding wayland-devel@ to CC, your feedback is
>>> much appreciated and backs up my design nicely. ;-)
>>>
>>
>> Hi again,
>>
>> still not subscribed, but maybe the server accepts the e-mail to
>> wayland-devel anyway as i'm subscribed to other xorg lists? I got an
>> "invalid captcha" error when trying to subscribe, despite no captcha was
>> ever presented to me? You may need to cc wayland-devel again for me.
>
> I guess the web interface is still down. Emailing
> wayland-devel-subscribe at lists.freedesktop.org should do, if I recall the
> address correctly. And someone might process the moderation queue, too.
>
> No worries anyway, I won't cut anything from you out, so it's all
> copied below.
>
>>>> 1. Wrt. an additional "preroll_feedback" request
>>>> <http://lists.freedesktop.org/archives/wayland-devel/2014-January/013014.html>,
>>>> essentially the equivalent of glXGetSyncValuesOML(), that would be very
>>>> valuable to us.
>>>>
>> ...
>>>
>>> Indeed, the "preroll_feedback" request was modeled to match
>>> glXGetSyncValuesOML.
>>>
>>> Do you need to be able to call GetSyncValues at any time and have it
>>> return ASAP? Do you call it continuously, and even between frames?
>>>
>>
>> Yes, at any time, even between frames, with a return asap. This is
>> driven by the specific needs of user code, e.g., to poll for a vblank,
>> or to establish a baseline of current (msc, ust) for timing stuff
>> relative. Psychtoolbox is an extension to a scripting language, so
>> usercode often decides how this is used.
>>
>> Internally to the toolkit these calls are used on X11/GLX to translate
>> target timestamps into target vblank counts for glXSwapBufferMscOML(),
>> because OML_sync_control is based on vblank counts, not absolute system
>> time, as you know, but ptb exposes an api where usercode specifies
>> target timestamps, like in your protocol proposal. The query is also
>> needed to work around some problems with the blocking nature of the X11
>> protocol when one tries to swap multiple independent windows with
>> different rates and uses glXWaitForSbcOML to wait for swap completion.
>> E.g., what doesn't work on X11 is using different x-display connections
>> - one per windows - and create GLX contexts which share resources across
>> those connections, so if you need multi-window operation you have to
>> create all GLX contexts on the same x-display connection and run all glx
>> calls over that connection. If you run multiple independent animations
>> in different windows you have to avoid blocking that x-connection, so i
>> use glXGetSyncValuesOML to poll current msc and ust to find out when it
>> is safe to do a blocking call.
>>
>> I hope that the way Waylands protocol works will make some of these
>> hacks unneccessary, but user script code can call glXGetSyncValues at
>> any time.
>
> Wayland won't allow you to use different connections even harder,
> because there simply are no sharable references to protocol objects. But
> OTOH, Wayland only provides you a low-level direct async interface to
> the protocol as a library, so you will be doing all the blocking in your
> app or GUI-toolkit.
>

Yes. If i understood correctly, with Wayland, rendering is separated 
from the presentation, all rendering and buffer management client-side, 
only buffer presentation server-side. So i hope i'll be able to untangle 
everything rendering related (like hope many OpenGL contexts to have and 
what resources they should or should not share) from everything 
presentation related. If the interface is async and has thread-safety in 
mind, that should hopefully help a lot on the presentation side.

The X11 protocol and at least XLib is not async and not thread-safe by 
default, and at least under DRI2 the x-server controlled and owned the 
back- and front-buffers, so you have lots of roundtrips and blocking 
behavior on the single connection to make sure no backbuffer is touched 
too early (while a swap is still pending), and many hacks to get around 
that, and some race conditions in DRI2 around drawable invalidation.

So i'd be really surprised if Wayland wouldn't be an improvement for me 
for multi-threaded or multi-window operations.

> Sounds like we will need the "subscribe to streaming vblank events
> interface" then. The recommended usage pattern would be to subscribe
> only when needed, and unsubscribe ASAP.
>
> There is one detail, though. Presentation timestamp is defined as "turns
> to light", not vblank. If the compositor knows about monitor latency, it
> will add this time to the presentation timestamp. To keep things
> consistent, we'd need to define it as a stream of turned-to-light
> events.
>

Yes, makes sense. The drm/kms timestamps are defined as "first pixel of 
frame leaves the graphics cards output connector" aka start of active 
scanout. In the time of CRT monitors, that was ~ "turns to light". In my 
field CRT monitors are still very popular and actively hunted for 
because of that very well defined timing behaviour. Or very expensive 
displays which have custom made panel or dlp controllers with well 
defined timing specifically for research use.

If the compositor knew the precise monitor latency it could add that as 
a constant offset to those timestamps. Do you know of reliable ways to 
get this info from any common commercial display equipment? Apple's OSX 
has API in their CoreVideo framework for getting that number, and i 
implement it in the OSX backend of my toolkit, but i haven't ever seen 
that function returning anything else than "undefined" from any display?

>>> Or would it be enough for you to present a dummy frame, and just wait
>>> for the presentation feedback as usual? Since you are asking, I guess
>>> this is not enough.
>>
>> Wouldn't be enough in all use cases. Especially if i want to synchronize
>> other modalities like sound or digital i/o, it is nice to not add
>> overhead or complexity on the graphics side for things not related to
>> rendering.
>>
>>>
>>> We could take it even further if you need to monitor the values
>>> continuously. We could add a protocol interface, where you could
>>> subscribe to an event, per-surface or maybe per-output, whenever the
>>> vblank counter increases. If the compositor is not in a continuous
>>> repaint loop, it could use a timer to approximately sample the values
>>> (ask DRM), or whatever. The benefit would be that we would avoid a
>>> roundtrip for each query, as the compositor is streaming the events to
>>> your app. Do you need the values so often that this would be worth it?
>>>
>>> For special applications this would be ok, as in those cases power
>>> consumption is not an issue.
>>
>> I could think of cases where this could be useful to have per-output if
>> it is somewhat accurate. E.g., driving shutter glasses or similar
>> mechanisms in sync with refresh, or emitting digital triggers or network
>> packets at certain target vblank counts. Neuro-Scientists often have to
>> synchronize stimulation or recording hardware to visual stimulus
>> presentation or simply log certain events, e.g., start of refresh
>> cycles, so it is possible synchronize the timing of different devices. I
>> hoped to use glXWaitForMscOML() for this in the past, but it didn't work
>> out because of the constraint that everything would have to go over one
>> x-display connection and that connection wasn't allowed to block for
>> more than fractions of a millisecond.
>
> I really do hope something like driving shutter glasses would be done
> by the kernel drivers, but I guess we're not there yet. Otherwise it'd
> be the compositor, but we don't have Wayland protocol for it yet,
> either. I have seen some discussion fly by about TV 3D modes.
>

I'd hope that as well, long-term. Until then i have to do hacks in my 
client app...

>> On drm/kms, queuing a vblank event would do the trick without the need
>> for Wayland to poll for vblanks.
>>
>> Of course i don't know how much use would there be for such
>> functionality outside of my slightly exotic application domain, except
>> maybe for "things to do with stereo shutter-glasses", like NVidia's
>> NVision stuff.
>>
>>>> 2. As far as the decision boundary for your presentation target
>>>> timestamps. Fwiw, the only case i know of NV_present_video extension,
>>>> and the way i expose it to user code in my toolkit software, is that a
>>>> target timestamp means "present at that time or later, but never
>>>> earlier" ie., at the closest vblank with tVblank >= tTargetTime.
>>>>
>>>> E.g., NV_present_video extension:
>>>> https://www.opengl.org/registry/specs/NV/present_video.txt
>>>>
>>>> Iow. presentation will be delayed somewhere between 0 and 1 refresh
>>>> duration wrt. the requested target time stamp tTargetTime, or on average
>>>> half a refresh duration late if one assumes totally uniform distribution
>>>> of target times. If i understand your protocol correctly, i would need
>>>> to add half a refresh duration to achieve the same semantic, ie. never
>>>> present earlier than requested. Not a big deal.
>>>>
>>>> In the end it doesn't matter to me, as long as the behavior is well
>>>> specified in the protocol and consistent across all Wayland compositor
>>>> implementations.
>>>
>>> Yes, you are correct, and we could define it either way. The
>>> differences would arise when the refresh duration is not a constant.
>>>
>>> Wayland presentation extension (RFC) requires the compositor to be able
>>> to predict the time of presentation when it is picking updates from the
>>> queue. The implicit assumption is that the compositor either makes its
>>> predicted presentation time or misses it, but never manages to present
>>> earlier than the prediction.
>>>
>>
>> Makes sense.
>>
>>> For dynamic refresh displays, I guess that means the compositor's
>>> prediction must be optimistic: the earliest possible time, even if the
>>> hardware usually takes longer.
>>>
>>> Does it sound reasonable to you?
>>>
>>
>> Yes, makes sense to me. For my use case, presenting a frame too late is
>> ok'ish, as that is the expected behaviour - on time or delayed, but
>> never earlier than wanted. So an optimistic guess for dynamic refresh
>> would be the right thing to do when picking a frame for presentation
>> from the queue.
>>
>> But i don't have any experience with dynamic refresh displays, so i
>> don't know how they would behave in practice or where you'd find them? I
>> read about NVidia's G-Sync stuff lately and about AMD's Freesync
>> variant, but are there any other solutions available? I know that some
>> (Laptop?) panels have the option for a lower freq. self-refresh? How do
>> they behave when there's suddenly activity from the compositor?
>
> You know roughly as much as I do. You even recalled the name
> Freesync. :-)
>
> I heard Chrome(OS?) is doing some sort of dynamic scanout rate
> switching based on update rates or something, so that would count as
> dynamic refresh too, IMO.
>
>>> I wrote it this way, because I think the compositor would be in a
>>> better position to guess the point of time for this screen update,
>>> rather than clients working on old feedback.
>>>
>>> Ah, but the "not before" vs. "rounded to" is orthogonal to
>>> predicting or not the time of the next screen update.
>>>
>>> I assume in your use cases the refresh duration will be required to be
>>> constant for predictability, so it does not really matter whether we
>>> pick "not before" or "rounding" semantics for the frame target time.
>>>
>>
>> Yes. Then i can always tweak my target timestamps to get the behavior i
>> want. For your current proposal i'd add half a refresh duration to what
>> usercode provides me, because the advice for my users is to provide
>> target timestamps which are half a refresh before the target vblank for
>> maximum robustness. This way your "rounding" semantics would turn into
>> the "not before" semantics i need to expose to my users.
>
> Exactly, and because you aim for constant refresh rate outputs, there
> should be no problems. Your users depend on constant refresh rate
> already, so that they actually can compute the timestamp according to
> your advice.
>
>> I can also always feed just single frames into your presentation queue
>> and wait for present_feedback for those single frames, so i can be sure
>> the "proper" frame was presented.
>
> Making that work so that you can actually hit every scanout cycle with
> a new image is something the compositor should implement, yes, but the
> protocol does not guarantee it. I suspect it would work in practise
> though.
>

It works well enough on the X-Server, so i'd expect it to work on 
Wayland as well. This link...

<https://github.com/Psychtoolbox-3/Psychtoolbox-3/blob/master/Psychtoolbox/PsychDocumentation/ECVP2010Poster_VisualTimingPrecision.pdf?raw=true>

...points to some results of tests i did a couple of years ago. Quite 
outdated by now, but Linux + X11 came out as more reliable wrt. timing 
precision than Windows and OSX, especially when realtime scheduling or 
even a realtime kernel was used, with both proprietary graphics drivers 
and the open-source drivers. I was very pleased with that :) - The other 
thing i learned there is how much dynamic power management on the gpu 
can bite you if the rendering/presentation behavior of your app doesn't 
match the expectations of the algorithms used to control up/downclocking 
on the gpu. That would be another topic related to a presentation 
extension btw., having some sort of hints to the compositor about what 
scheduling priority to choose, or if gpu power management should be 
somehow affected by the timing needs of clients...

> But, this might be a reason to have the "frame callback" be queueable.
> We have just had lots of discussion about what do we do with the frame
> callback, what it means, and how it should work. It's not clear yet.
>
>> Ideally i could select when queuing a frame, if the compositor will skip
>> the frame if it is late, that is, when frames with a later target
>> presentation time are already queued, which are closer to the predicted
>> presentation time. Your current design is optimized for video playback,
>> where you want to drop late frames to maintain audio-video sync. I have
>> both cases. For video playback and periodic/smooth animations i want to
>> drop frames which are late, but for some other purposes i need to be
>> sure that although no frame is ever presented too early, all frames are
>> presented in sequence without dropping any, even if that increases the
>> mean error of target time vs. real presentation time.
>>
>> Things like this stuff:
>> http://link.springer.com/article/10.3758/s13414-013-0605-z
>>
>> Such studies need to present sequences of potentially dozens of images
>> in rapid succession, without dropping a single frame. Duplicating a
>> frame is not great, but not as bad as dropping one, because certain
>> frames in such sequences may have a special meaning.
>>
>> This can also be achieved by my client just queuing one frame for
>> presentation and waiting for present feedback before queuing the next
>> one. I would lose the potential robustness advantage of queuing in the
>> server instead of one round trip away in my client. But that's how it's
>> done atm. with X11/GLX/DRI2 and on other operating systems because they
>> don't have presentation queues.
>
> A flag for "do not skip", that sounds simpler than I thought. I've
> heard that QtQuick would want no-skip for whatever reason, too, and so
> far thought posting one by one rather than queueing in advance would do
> it, since I didn't understand the use case. You gave another one.
>
>>> But for dynamic refresh, can you think of reasons to prefer one or the
>>> other?
>>>
>>> My reason for picking the "rounding" semantics is that an app can queue
>>> many frames in advance, but it will only know the one current or
>>> previous refresh duration. Therefore subtracting half of that from the
>>> actually intended presentation times may become incorrect if the
>>> duration changes in the future. When the compositor does the equivalent
>>> extrapolation for each framebuffer flip it schedules, it has better
>>> accuracy as it can use the most current refresh duration estimate for
>>> each pick from the queue.
>>>
>>
>> The "rounding" semantics has the problem that if the client doesn't want
>> it, as in my case, it needs to know the refresh duration to counteract
>> your rounding by shifting the timestamps. On a dynamic refresh display i
>> wouldn't know how to do that. But this is basically your thinking
>> backwards, for a different definition of robust.
>
> Right. When you have to set time based on an integer frame counter, it
> is hard to reason about when it actually is. The only possibility is to
> use the "not before" definition, because you just cannot subtract half
> with integers, and the counter ticks when the buffer flips, basically.
>
> But with nanosecond timestamps we actually have a choice.
>
> Since the definition is turns-to-light and not vblank, I still think
> rounding is more appropriate, since it is the closest we can do for a
> request to "show this at time t". But yes, I agree the wanted behaviour
> may depend on the use case.
>

Agreed.

>>
>> As far as i'm concerned i'd tell my users that if they want it to work
>> well enough on a dynamic refresh display, they'd have to provide
>> different target timestamps to accommodate the different display
>> technology - in these cases ones that are optimized for your "rounding"
>> semantics.
>>
>> So i think your proposal is fine. My concern is mostly to not get any
>> surprises on regular common fixed refresh rate displays, so that user
>> code written many years ago doesn't suddenly behave in weird and
>> backwards incompatible ways, even on exactly the same hardware, just
>> because they run it on Wayland instead of X11 or on some other operating
>> system. Having to rewrite dozens of their scripts wouldn't be the
>> behavior my users would expect after upgrading to some future
>> distribution with Wayland as display server ;-)
>
> Keeping an eye on such concerns is exactly where I need help. :-)
>
>>>> 3. I'm not sure if this applies to Wayland or if it is already covered
>>>> by some other part of the Wayland protocol, as i'm not yet really
>>>> familiar with Wayland, so i'm just asking: As part of the present
>>>> feedback it would be very valuable for my kind of applications to know
>>>> how the present was done. Which backend was used? Was the presentation
>>>> performed by a kms pageflip or something else, e.g., some partial buffer
>>>> copy? The INTEL_swap_event extension for X11/GLX exposes this info to
>>>> classic X11/GLX clients. The reason is that drm/kms-pageflips have very
>>>> reliable and precise timestamps attached, so i know those timestamps are
>>>> trustworthy for very timing sensitive applications like mine. If page
>>>> flipping isn't used, but instead something like a (partial) framebuffer
>>>> copy/blit, at least on drm the returned timestamps are then not
>>>> reliable/trustworthy enough, as they could be off by up to 1 video
>>>> refresh duration. My software, upon detecting anything but a
>>>> page-flipped presentation, logs errors or even aborts a session, as
>>>> wrong timestamps or presentation timestamping could be a disaster for
>>>> those applications if it went unnoticed.
>>>>
>>>> So this kind of feedback about how the present was executed would be
>>>> useful to us.
>>>
>>> How presentation was executed is not exposed in the Wayland protocol in
>>> any direct nor reliable way.
>>>
>>> But rather than knowning how presentation was executed, it sounds like
>>> you would really want to know how much you can trust the feedback,
>>> right?
>>>
>>
>> Yes. Which means i also need to know how much i can trust the feedback
>> about how much i can trust the feedback ;-).
>>
>> Therefore i prefer low level feedback about what is actually happening,
>> so i can look at the code myself and also sometimes perform testing to
>> find out, which presentation methods are trustworthy and which not. Then
>> have some white-lists or black-lists of what is ok and what is not.
>>
>>> In Weston's case, there are many ways the presentation can be executed.
>>>
>>> Weston-on-DRM, regardless of using the GL (hardware) or Pixman
>>> (software) renderer, will possibly composite and always flip, AFAIU. So
>>> this is both vsync'd and accurate by your standards.
>>>
>>
>> That's good. This is the main use case, so knowing i use the drm backend
>> would basically mean everything's fine and trustworthy.
>> Most of my users applications run as unoccluded, undecorated fullscreen
>> windows, just filling one or multiple outputs, so they use page-flipping
>> on X11 as well. For windowed presentation i couldn't make any
>> guarantees, but if Wayland always composes a full framebuffer and flips
>> in such a case then timestamps should be trustworthy.
>
> Yes.
>
>>> Weston-on-fbdev (only Pixman renderer) is not vsync'd, and the
>>> presentation timing is based on reading a software clock after an
>>> arbitrary delay. IOW, the feedback and presentation is rubbish.
>>>
>>> Weston-on-X11 is basically as useless as Weston-on-fbdev, in its current
>>> implementation.
>>>
>>
>> Probably both not relevant for me.
>>
>>> Weston-on-rpi (rpi-renderer) uses vsync'd flips to the best of our
>>> knowledge, but can and will fall back to "firmware compositing" when
>>> the scene is too complex for its "hardware direct scanout engine". We
>>> do not know when it falls back. The presentation time is recorded by
>>> reading a software clock (clock_gettime) in a thread that gets called
>>> asynchronously by the userspace driver library on rpi. It should be
>>> better than fbdev, but for your purposes still a black box of rubbish I
>>> guess.
>>>
>>
>> There were a few requests for Rasperry Pi, but its hardware is mostly
>> too weak for our purposes. So also not a big deal for my purpose.
>
> Well, it depends. If you have only ever seen the X desktop on RPi, then
> you have not seen what it can actually do.
>
> DispmanX (the proprietary display API) can easily allow you to flip
> full-HD buffers at framerate, if you first have enough memory to
> preload them. There is a surprisingly lot of power compared to the
> "size", but it's all behind proprietary interfaces.
>
> FWIW, Weston on rpi uses DispmanX, and compared to X, it flies.
>

I've seen a couple of impressive things with those in the lab. The 
limitations for my toolkit are more wrt. RAM size, not wrt. gpu. Model 
B's 512 MB RAM are the minimum i'd probably need, but i should probably 
play with one and see how far i get.

>>> It seems like we could have several flags describing the aspects of
>>> the presentation feedback or presentation itself:
>>>
>>> 1. vsync'd or not
>>> 2. hardware or software clock, i.e. DRM/KMS ioctl reported time vs.
>>>      compositor calls clock_gettime() as soon as it is notified the screen
>>>      update is done (so maybe kernel vs. userspace clock rather?)
>>> 3. did screen update completion event exist, or was it faked by an
>>>      arbitrary timer
>>> 4. flip vs. copy?
>>>
>>
>> Yes, that would be sufficient for my purpose. I can always get the
>> OpenGL renderer/vendor string to do some basic checks for which kms
>> driver is in use, and your flags would give me all needed dynamic
>> information to judge reliability.
>
> Except that the renderer info won't help you, if the machine has more
> than one GPU. It is quite possible to render on one GPU and scan out
> on another.
>

Yes, but i already use libpciaccess to enumerate gpu's on the bus, and 
other more scary things, so i guess there will be a few more scary and 
shady low level things to add ;-)

>>> Now, I don't know what implements the "buffer copy" update method you
>>> referred to with X11, or what it actually means, because I can think of
>>> two different cases:
>>> a) copy from application buffer to the current framebuffer on screen,
>>>      but timed so that it won't tear, or
>>> b) copy from application buffer to an off-screen framebuffer, which is
>>>      later flipped to screen.
>>>
>>> Case b) is the normal compositing case that Weston-on-DRM does in GL and
>>> Pixman renderers.
>>>
>>
>> Case a) was the typical case on classic X11 for swaps of non-fullscreen
>> windows. Case b) is fine. On X11 + old compositors there was no suitable
>> protocol in place. The Xserver/ddx would copy to some offscreen buffer
>> and the compositor would do something with that offscreen buffer, but
>> timestamps would refer to when the offscreen copy was scheduled to
>> happen, not when the final composition showed on the screen. Thereby the
>> timestamps were overly optimistic and useless for my purpose, only good
>> enough for some throttling.
>>
>>> Weston can also flip the application buffer directly on screen if the
>>> buffer is suitable and the scene graph allows; a decision it does on a
>>> frame-by-frame basis. However, this flip does not imply that *all*
>>> compositing would have been skipped. Maybe there is a hardware overlay
>>> that was just right for one wl_surface, but other things are still
>>> composited, i.e. copied.
>>
>> Given that my app mostly renders to opaque fullscreen windows, i'd
>> expect Weston to probably just flip its buffer to the screen.
>
> Indeed, if you use EGL. At the moment we have no other portable ways to
> post suitable buffers.
>
>>>
>>> I'm not sure what we should be distinguishing here. No matter how
>>> weston updates the screen, the flags 1-3 would still be applicable and
>>> fully describe the accuracy of presentation and feedback, I think.
>>> However, the copy case a) is something I don't see accounted for. So
>>> would the flag 4 be actually "copy case a) or not"?
>>>
>>
>> Yes, if Weston uses page-flipping in the end on drm/kms and the returned
>> timestamp is the page-flip completion timestamp from the kernel, then it
>> doesn't matter if copies were involved somewhere.
>>
>>> The problem is that I do not know why the X11 framebuffer copy case you
>>> describe would have inaccurate timestamps. Do you know where it comes
>>> from, and would it make sense to describe it as a flag?
>>>
>>
>> That case luckily doesn't seem to exist on Weston + drm/kms, if i
>> understand you correctly - and the code in drm_output_repaint(), if i
>> look at the right bits?
>
> You're right on Weston. It still does not prevent someone else writing
> a compositor that does the just-in-time copy into a live framebuffer
> trick, but if someone does that, I would expect them to give accurate
> presentation feedback nevertheless. There is no excuse to give
> anything else.
>
> FYI, while Weston is the reference compositor, there is no "the
> server". Every major DE is writing their own compositor-servers. Most
> will run on DRM/KMS directly, and I think one has chosen to run only on
> top of another Wayland compositor.
>

Yes, that will be a whole lot of extra "fun" of testing, once i have a 
Wayland backend ;-) -- In the end i'll have to focus on maybe two or 
three compositors like weston, kwin, gnome-shell wrt. to timing.

>> Old compositors like, e.g., Compiz, would do things like for partial
>> screen updates only recompose the affected part of the backbuffer and
>> then use functions like glXCopySubBufferMESA() or glCopyPixels() to copy
>> the affected backbuffer area to the frontbuffer, maybe after waiting for
>> vblank before executing the command.
>>
>> Client calls glXSwapBuffer/glXSwapBufferMscOML -> server/ddx queues
>> vblank event in kernel -> kernel delivers vblank event to server/ddx at
>> target msc count -> ...random delay... -> ddx copies client window
>> backbuffer pixmap to offscreen surface and posts damage and returns the
>> vblank timestamp of the triggering vblank event as swap completion
>> timestamp to the client -> ...random delay... -> compositor does its
>> randomly time consuming thing which eventually leads to an update of the
>> display.
>
> I regret I asked. ;-)
>
>> So in any case the returned timestamp from the vblank event delivered by
>> the kernel was always an (overly) optimistic one, also because all the
>> required rendering and copy operations would be queued through the gpu
>> command stream where they might get delayed by other pending rendering
>> commands, or maybe the gpu waiting for the scanout being outside the
>> target area for a copy to avoid tearing.
>>
>> Without a compositor it worked the same for windowed windows, just that
>> the ddx queued some blit (i think via exa or uxa or maybe now sna on
>> intel iirc) into the gpu's command stream, with the same potential
>> delays due to queued commands in the command stream.
>>
>> Only if the compositor was unredirecting fullscreen windows, or no
>> compositor was there at all + fullscreen window, would the client get
>> kms page-flipping and the reliable page flip completion timestamps from
>> the kernel.
>>
>>> This system of flags was the first thing that came to my mind. Using
>>> any numerical values like "the error in presentation timestamp is
>>> always between [min, max]" seems very hard to implement in a
>>> compositor, and does not even convey information like vsync or not.
>>>
>>
>> Indeed. I would have trust issues with that.
>>
>>> I suppose a single flag "timestamp is accurate" does not cut it? :-)
>>>
>>> What do you think?
>>>
>>
>> Yes, the flags sound good to me. As long as i know i'm on a drm/kms
>> backend they should be all i need.
>
> That... is a problem. :-)
>
>  From your feedback so far, I think you have only requested additional
> features:
> - ability to subscribe to a stream of vblank-like events
> - do-not-skip flag for queued updates
>

Yes, and the present_feedback flags.

One more useful flag for me could be to know if the presented frame was 
composited - together with some other content - or if my own buffer was 
just flipped onscreen / no composition took place. More specifically - 
and maybe there's already some other event in the protocol for that - 
i'd like to know if my presented surface was obscured by something else, 
e.g., some kind of popup window like "system updates available", "you 
have new mail", a user ALT+Tabbing away the window etc. On X11 i find 
out indirectly about such unwanted visual disruptions because the 
compositor would fall back to compositing instead of simple 
page-flipping. On Wayland something similar would be cool if it doesn't 
already exist.

Not sure if that still belongs into a present_feedback extension though.

> I will think about those.
>

Cool, thanks!
-mario