[RFC v2] Wayland presentation extension (video protocol)

Thu Jan 30 07:35:17 PST 2014

Hi,

it's time for a take two on the Wayland presentation extension.

		1. Introduction

The v1 proposal is here:
http://lists.freedesktop.org/archives/wayland-devel/2013-October/011496.html

In v2 the basic idea is the same: you can queue frames with a
target presentation time, and you can get accurate presentation
feedback. All the details are new, though. The re-design started
from the wish to handle resizing better, preferably without
clearing the buffer queue.

All the changed details are probably too much to describe here,
so it is maybe better to look at this as a new proposal. It
still does build on Frederic's work, and everyone who commented
on it. Special thanks to Axel Davy for his counter-proposal and
fighting with me on IRC. :-)

Some highlights:

- Accurate presentation feedback is possible also without
  queueing.

- You can queue also EGL-based rendering, and get presentation
  feedback if you want. Also EGL can do this internally, too, as
  long as EGL and the app do not try to use queueing at the same time.

- More detailed presentation feedback to better allow predicting
  future display refreshes.

- If wl_viewport is used, neither video resolution changes nor
  surface (window) size changes alone require clearing the queue.
  Video can continue playing even during resizes.

The protocol interfaces are arranged as

	global.method(wl_surface, ...)

just for brewity. We could as well do the factory approach:

	o = global.get_presentation(wl_surface)
	o.method(...)

Or if we wanted to make it a mandatory part of the Wayland core
protocol, we could just extend wl_surface itself:

	wl_surface.method(...)

and put the clock_id event in wl_compositor. That all is still
open and fairly uninteresting, so let's concentrate on the other
details.

The proposal refers to wl_viewport.set_source and
wl_viewport.destination requests, which do not yet exist in the
scaler protocol extension. These are just the wl_viewport.set
arguments split into separate src and dst requests.

Here is the new proposal, some design rationale follows. Please,
do ask why something is designed like it is if it puzzles you. I
have a load of notes I couldn't clean up for this email. This
does not even intend to completely solve all XWayland needs, but
for everything native on Wayland I hope it is sufficient.

		2. The protocol specification

<?xml version="1.0" encoding="UTF-8"?>
<protocol name="presentation_timing">

  <copyright>
    Copyright © 2013-2014 Collabora, Ltd.

    Permission to use, copy, modify, distribute, and sell this
    software and its documentation for any purpose is hereby granted
    without fee, provided that the above copyright notice appear in
    all copies and that both that copyright notice and this permission
    notice appear in supporting documentation, and that the name of
    the copyright holders not be used in advertising or publicity
    pertaining to distribution of the software without specific,
    written prior permission.  The copyright holders make no
    representations about the suitability of this software for any
    purpose.  It is provided "as is" without express or implied
    warranty.

    THE COPYRIGHT HOLDERS DISCLAIM ALL WARRANTIES WITH REGARD TO THIS
    SOFTWARE, INCLUDING ALL IMPLIED WARRANTIES OF MERCHANTABILITY AND
    FITNESS, IN NO EVENT SHALL THE COPYRIGHT HOLDERS BE LIABLE FOR ANY
    SPECIAL, INDIRECT OR CONSEQUENTIAL DAMAGES OR ANY DAMAGES
    WHATSOEVER RESULTING FROM LOSS OF USE, DATA OR PROFITS, WHETHER IN
    AN ACTION OF CONTRACT, NEGLIGENCE OR OTHER TORTIOUS ACTION,
    ARISING OUT OF OR IN CONNECTION WITH THE USE OR PERFORMANCE OF
    THIS SOFTWARE.
  </copyright>

  <interface name="presentation" version="1">
    <description summary="timed presentation related wl_surface requests">
      The main features of this interface are accurate presentation
      timing feedback, and queued wl_surface content updates to ensure
      smooth video playback while maintaining audio/video
      synchronization. Some features use the concept of a presentation
      clock, which is defined in presentation.clock_id event.

      Requests 'feedback' and 'queue' can be regarded as additional
      wl_surface methods. They are part of the double-buffered
      surface state update mechanism, where other requests first set
      up the state and then wl_surface.commit atomically applies the
      state into use. In other words, wl_surface.commit submits a
      content update.

      Interface wl_surface has requests to set surface related state
      and buffer related state, because there is no separate interface
      for buffer state alone. Queueing requires separating the surface
      from buffer state, and buffer state can be queued while surface
      state cannot.

      Buffer state includes the wl_buffer from wl_surface.attach, the
      state assigned by wl_surface requests frame,
      set_buffer_transform and set_buffer_scale, and any
      buffer-related state from extensions, for instance
      wl_viewport.set_source. This state is inherent to the buffer
      and the content update, rather than the surface.

      Surface state includes all other state associated with
      wl_surfaces, like the x,y arguments of wl_surface.attach, input
      and opaque regions, damage, and extension state like
      wl_viewport.destination. In general, anything expressed in
      surface local coordinates is better as surface state.

      The standard way of posting new content to a surface using the
      wl_surface requests damage, attach, and commit is called
      immediate content submission. This happens when a
      presentation.queue request has not been sent since the last
      wl_surface.commit.

      The new way of posting a content update is a queued content
      update submission. This happens on a wl_surface.commit when a
      presentation.queue request has been sent since the last
      wl_surface.commit.

      Queued content updates do not get applied immediately in the
      compositor but are pushed to a queue on receiving the
      wl_surface.commit. The queue is ordered by the submission target
      timestamp. Each item in the queue contains the wl_buffer, the
      target timestamp, and all the buffer state as defined above. All
      the queued state is taken from the pending wl_surface state at
      the time of the commit, exactly like an immediate commit would
      have taken it.

      For instance on a queueing commit, the pending buffer is queued
      and no buffer is pending afterwards. The stored values of the
      x,y parameters of wl_surface.attach are reset to zero, but they
      also are not queued; queued content updates do not carry the
      attach offsets. All other surface state (that is not queued),
      e.g. damage, is not applied nor reset.

      Issuing a queueing commit without a wl_surface.attach is
      undefined. However, queueing a commit with explicitly attached
      NULL wl_buffer works; when and if the content update is
      executed, the surface content is removed as defined for
      wl_surface.attach.

      If a queued content update has been submitted, and the wl_buffer
      used in the update is destroyed before the wl_buffer.release
      event, the results are undefined. The compositor may or may not
      have executed the update, therefore the surface contents become
      undefined as explained in wl_surface.attach. Whether any
      presentation feedback or frame callbacks occur is undefined.

      For each surface, the compositor maintains an association to a
      single output that is considered as the main output for the
      surface. Queued content updates are synchronized to the
      surface's main output, to provide a consistent and meaningful
      definition of the moment the update is displayed to the user.
      When a compositor updates an output, it processes only the
      queues of the surfaces whose main output is the one being
      updated. The queues of other surfaces, even if they are part of
      the redrawing, are not processed at that time.

      When a compositor chooses to update an output, it must predict
      the presentation clock value when the display update will occur.
      For the definition of the moment of display update, see
      presentation_feedback.presented. Therefore if the prediction is
      absolutely perfect, presentation_feedback.presented will carry
      the same clock value.

      For each surface with queued content updates and matching main
      output, the compositor picks the update with the highest
      timestamp no later than a half frame period after the predicted
      presentation time. The intent is to pick the content update
      whose target timestamp as rounded to the output refresh period
      granularity matches the same display update as the compositor is
      targeting, while not displaying any content update more than a
      half frame period too early. If all the updates in the queue are
      already late, the highest timestamp update is taken regardless
      of how late it is. Once an update in a queue has been chosen,
      all remaining updates with an earlier timestamp in the queue are
      discarded.

      The compositor applies the chosen update to the wl_surface,
      regardless of possible wl_subsurface.set_sync mode. This allows
      e.g. a video to continue running in a sub-surface also during
      window resizing. It is assumed that buffer state updates do not
      cause visual disruption to the window like surface state updates
      can. Support for wl_viewport is needed for glitch-free resizing
      if the resizing involves changing the (sub-)surface size.

      When the chosen update is applied, the associated frame
      callbacks are sent. Damage for the whole surface is assumed,
      as damage is not explicitly queued with buffer state.

      When the final realized presentation time is available, e.g.
      after a framebuffer flip completes, the requested
      presentation_feedback.presented events are sent. The final
      presentation time can differ from the compositor's predicted
      display update time and the update's target time, especially
      when the compositor misses its target vertical blanking period.

      When updates from the queue are discarded, the
      presentation_feedback.discarded event is delivered if feedback
      was requested. Also the associated frame callbacks are sent.

      An immediate content update with an attach request automatically
      discards the whole queue just before the update gets applied. If
      wl_surface.attach has not been sent for an immediate content
      submission, the queue is not discarded, and the content update
      applies only the surface state, but no buffer state.

      If a wl_surface has queued content updates when it is destroyed,
      the whole queue is implicitly discarded as if
      presentation.discard_queue was sent immediately prior to the
      destroy request.
    </description>

    <request name="destroy" type="destructor">
      <description summary="unbind from the presentation interface">
        Informs the server that the client will not be using this
        protocol object anymore. This does not affect any content
        update queues nor existing objects created by this interface.
      </description>
    </request>

    <request name="feedback">
      <description summary="request presentation feedback information">
        With this request, presentation feedback will be provided for
        the current content submission of the given surface. A new
        presentation_feedback object is created, and that object will
        deliver the information once. The object is tied to this
        content submission only. Multiple presentation_feedback objects
        may be created for the same submission, and they will all
        return the same information.

        For details on what information is returned, see
        presentation_feedback interface.
      </description>

      <arg name="surface" type="object" interface="wl_surface"
           summary="target surface"/>
      <arg name="callback" type="new_id" interface="presentation_feedback"
           summary="new feedback object"/>
    </request>

    <request name="queue">
      <description summary="queue the buffer instead of immediate presentation">
        This request changes the behaviour of the very next
        wl_surface.commit of the given wl_surface and that commit
        only. Instead of immediately applying the pending wl_surface
        state as defined in wl_surface.commit, the commit will queue a
        new content update, using the pending buffer state only.

        For a more detailed description and what is buffer state, see
        the documentation for presentation interface.

        The value of the target timestamp is in the presentation clock
        domain, see presentation.clock_id.

        If queue request has already been sent for the unfinished content
        update submission on the given wl_surface, a new queue request
        will override the previous one.
      </description>

      <arg name="surface" type="object" interface="wl_surface"
           summary="target surface"/>
      <arg name="tv_sec" type="uint"
           summary="seconds part of the target timestamp"/>
      <arg name="tv_nsec" type="uint"
           summary="nanoseconds part of the target timestamp"/>
    </request>

    <request name="discard_queue">
      <description summary="discard the whole queue of the given surface">
        This request discards the whole remaining content update queue
        of the given wl_surface. Once the compositor has processed
        this request, no more queued updates will happen on the
        surface until the client queues new updates. Discard_queue is
        processed immediately when the compositor dispatches the
        request.

        A client can issue a wl_display.sync after this, and once the
        sync returns, the client has received all
        presentation_feedback.discarded events resulting from the
        discard_queue. However, presentation_feedback.presented
        events may arrive later if the compositor executed a queued
        content update before the discard_queue.
      </description>

      <arg name="surface" type="object" interface="wl_surface"
           summary="target surface"/>
    </request>

    <event name="clock_id">
      <description summary="clock ID for timestamps">
        This event tells the client, in which clock domain the
        compositor interprets the timestamps used by the presentation
        extension. This clock is called the presentation clock.

        The compositor sends this event when the client binds to the
        presentation interface. The presentation clock does not change
        during the lifetime of the client connection.

        The clock identifier is platform dependent. Clients must be
        able to query the current clock value directly, not by asking
        the compositor.

        On Linux/glibc, the identifer value is one of the clockid_t
        values accepted by clock_gettime(). clock_gettime() is defined
        by POSIX.1-2001.

        Compositors should prefer a clock, which does not jump and is
        not slewed e.g. by NTP. The absolute value of the clock is
        irrelevant. Precision of one millisecond or better is
        recommended.

        Timestamps in this clock domain are expressed as tv_sec,
        tv_nsec pairs, each component being an unsigned 32-bit value.
        Whole seconds are in tv_sec, and the additional fractional
        part in tv_nsec as nanoseconds. Hence, for valid timestamps
        tv_nsec must be in [0, 999999999].

        Note, that clock_id applies only to the presentation clock,
        and implies nothing about e.g. the timestamps used in the
        Wayland core protocol input events.
      </description>

      <arg name="clk_id" type="uint" summary="platform clock identifier"/>
    </event>

  </interface>

  <interface name="presentation_feedback" version="1">
    <description summary="presentation time feedback event">
      A presentation_feedback object returns the feedback information
      about a wl_surface content update becoming visible to the user.
      One object corresponds to one content update submission
      (wl_surface.commit), queued or immediate. There are two possible
      outcomes: the content update may be presented to the user, in
      which case the presentation timestamp is delivered. Otherwise,
      the content update is discarded, and the user never had a chance
      to see it before it was superseded or the surface was destroyed.

      Once a presentation_feedback object has delivered an event, it
      becomes inert, and should be destroyed by the client.
    </description>

    <request name="destroy" type="destructor">
      <description summary="destroy presentation feedback object">
        The object is destroyed. If a feedback event had not been
        delivered yet, it is cancelled.
      </description>
    </request>

    <event name="sync_output">
      <description summary="presentation synchronized to this output">
        As presentation can be synchronized to only one output at a
        time, this event tells which output it was. This event is only
        sent prior to the presented event.

        As clients may bind to the same global wl_output multiple
        times, this event is sent for each bound instance that matches
        the synchronized output. If a client has not bound to the
        right wl_output global at all, this event is not sent.
      </description>

      <arg name="output" type="object" interface="wl_output"
           summary="presentation output"/>
    </event>

    <event name="presented">
      <description summary="the content update was displayed">
        The associated content update was displayed to the user at the
        indicated time (tv_sec, tv_nsec). For the interpretation of the
        timestamp, see presentation.clock_id event.

        The timestamp corresponds to the time when the content update
        turned into light the first time on the surface's main output.
        Compositors may approximate this from the framebuffer flip
        completion events from the system, and the latency of the
        physical display path if known.

        This event is preceeded by all related sync_output events
        telling which output's refresh cycle the feedback corresponds
        to, i.e. the main output for the surface. Compositors are
        recommended to choose to the output containing the largest
        part of the wl_surface, or keeping the output they previously
        chose. Having a stable presentation output association helps
        clients to predict future output refreshes (vblank).

        Argument 'refresh' gives the compositor's prediction of how
        many nanoseconds after tv_sec, tv_nsec the very next output
        refresh may occur. This is to further aid clients in
        predicting future refreshes, i.e., estimating the timestamps
        targeting the next few vblanks. If such prediction cannot
        usefully be done, the argument is zero.

        The 64-bit value combined from seq_hi and seq_lo is the value
        of the output's vertical retrace counter when the content
        update was first scanned out to the display. This value must
        be compatible with the definition of MSC in
        GLX_OML_sync_control specification. Note, that if the display
        path has a non-zero latency, the time instant specified by
        this counter may differ from the timestamp's.

        If the output does not have a constant refresh rate, explicit
        video mode switches excluded, then the refresh argument must
        be zero.

        If the output does not have a concept of vertical retrace or a
        refresh cycle, or the output device is self-refreshing without
        a way to query the refresh count, then the arguments seq_hi
        and seq_lo must be zero.
      </description>

      <arg name="tv_sec" type="uint"
           summary="seconds part of the presentation timestamp"/>
      <arg name="tv_nsec" type="uint"
           summary="nanoseconds part of the presentation timestamp"/>
      <arg name="refresh" type="uint" summary="nanoseconds till next refresh"/>
      <arg name="seq_hi" type="uint"
           summary="high 32 bits of refresh counter"/>
      <arg name="seq_lo" type="uint"
           summary="low 32 bits of refresh counter"/>
    </event>

    <event name="discarded">
      <description summary="the content update was not displayed">
        The content update was never displayed to the user.
      </description>
    </event>
  </interface>

</protocol>

		3. Why UST all the way?

	3.1. UST and MSC pros and cons

Unadjusted System Time (UST) and graphics Media Stream Counter
(MSC) are defined by the GLX_OML_sync_control specification. UST
is basically a stable wall clock with a tick rate close to the
"universal true time", i.e. the real time, while MSC is a frame
or refresh cycle counter and not a clock.

Should we use UST or MSC, or maybe allow both, for queueing
buffers and measuring the presentation time?

MSC pro: Is what all graphics systems that I know of seem to be
using currently, and matches how past and most of current
hardware works.

UST con: Would be a new concept to adapt to for frame counter
based algorithms. For actual hardware operations, needs to be
converted to MSC in most cases.

MSC con: tick rate depends on the output device currently in use
for the window, and can also change with video mode switches.

UST pro: tick rate is guaranteed to be constant.

MSC con: for an output device that is not based on periodic
refresh cycles, e.g. on-demand refresh or variable rate, it has
no predictable correspondence to the "universal true time".

UST pro: once you establish the relatioship to "universal true
time", it holds practically indefinitely. This means you can
reliably relate UST to other proper clocks and maintain e.g.
audio/video sync.

MSC pro: corresponds exactly to when a monitor refreshes,
provided that the monitor uses periodic refreshing. The MSC
increment between two consecutive vblanks is 1.

UST con: to hit a vblank, you have to estimate and compute the
right UST value based on feedback data. The UST increment
between two consecutive vblanks may not be an integer
(nanoseconds).

MSC con: an application cannot read the current MSC value on its
own, it needs to ask the display server about it, which is a
protocol roundtrip to another process. Determining current MSC
also involves determining which window or output it should
relate to.

UST pro: an application can query the current UST value directly
with a system call (clock_gettime).

	3.2. Conclusions:

	MSC alone cannot be used to achieve reliable A/V sync,
because its relationship to any other clock (e.g. audio clock)
can change: MSC may suddenly start to tick faster or slower.
Only UST is reliably synchronizable to other clocks, therefore
UST should be the "common language" in the Wayland protocol.
Using MSC would always require some context, like the window or
output.

	Clients need to estimate the UST values when vblanks
happen, so that they can schedule presentation for certain
monitor refresh cycles and to reduce jitter and latency. Also
compositors need to convert presentation UST timestamps into
hardware MSC values or equivalent, since the update can only
happen during vblanks. However, this applies only to periodic
scanout style hardware, and not to variable refresh rate or
on-demand monitors. Therefore UST, while being more complicated
to use, is the more future-proof concept.

	If the compositor can affect when the monitor is
refreshed, using an MSC as presentation target time would not
give any clue of when the presentation should actually happen,
making variable refresh rate display hardware have basically no
advantage.

		4. Bits and pieces

	4.1. Damage

Do not queue damage, because damage is in surface coordinates,
and syncing it from the queue is hard. Do not queue anything
that is in surface coordinates. Doing so would require
discarding the whole queue whenever the surface size changes.
The design explicitly allows changing surface and buffer sizes
asynchronously, if wl_viewport is available.

	4.2. Moment of presentation

Presentation time is defined as "turns into light", because
modern TVs may have significant latency before the pixel going
into wire and received by the TV turns into light. This should
not be conflicting with the definition in GLX_OML_sync_control,
which was presumably written on the era of CRT monitors and the
latency for turn-to-light was insignificant.

	4.3. Swap buffer count

SBC does not require any protocol nor server side support,
because the client is in complete control of the swaps to the
wl_surface or gets the needed feedback via presentation_feedback
and frame callbacks, and no other client can access the same
wl_surface.

	4.4. Intentional tearing

Implementing GLX_EXT_swap_control_tear would require the Wayland
compositor to cause tearing on purpose. Hence it is not
considered here.

	4.5. The frame callback and swap interval

The frame callback needs to be with the buffer state, so it gets
queued. If a client makes e.g. EGL's commits queued, EGL may
still rely on frame callbacks for blocking apps properly, and
that is related to presenting the buffer, not just the very next
output refresh. EGL may also internally use queueing and
feedback to implement swap interval > 1.

	4.6. Interlacing

Supporting interlaced material and displays is punted for a
later extension. Presumably the protocol supporting interlaced
content would be as simple as having an extra wl_surface-like
request to say on which of the two fields the content should be
displayed first. The field designation would be an additional
restriction on when a content update should initially hit the
screen. I.e. if both field and target timestamp are given, both
conditions must pass. This means that giving a field may delay
the presentation by one output refresh cycle, assuming the
output scans out alternating fields. Additionally there should
be an extension to inform the client, which field the top-most
scanline of the buffer will hit, or equivalent information. This
assumes that the even scanlines in a buffer correspond to one
field, and the odd scanlines correspond to the other field,
regardless of how these terms are defined.

		5. X11 Present and XWayland

Comparison between X11 Present and Wayland with presentation
extension or rather how to map one to the other. This is
supposed to provide some faith on how Present could be supported
on XWayland.

http://cgit.freedesktop.org/xorg/proto/dri3proto/tree/dri3proto.txt
http://cgit.freedesktop.org/xorg/proto/presentproto/tree/presentproto.txt

The fundamental difference between X11 Present and Wayland
(without XWayland specific extensions) is that Present supports
scheduled copy operations, which in the pathological cases
cannot easily be done in advance. Wayland requires complete
buffers, but Present may imply blits as a part of posting window
content to display.

The workhorse of X11 Present is the PresentPixmap request. Its
arguments with their likely corresponding Wayland concepts are:

- window: wl_surface

- pixmap: wl_buffer?

- serial: a new presentation_feedback object

- valid-area: N/A, always assumed to be the whole buffer. If not
the whole buffer, the X server must create a copy.

- update-area: wl_surface.damage

- x-off, y-off: N/A, non-zero values imply a copy

- wait-fence: N/A, Wayland does not have an explicit rendering
fence protocol. It is assumed that the underlying driver stack
prevents reading from a buffer whose rendering is still on-going
or pending.

- idle-fence/PresentIdleNotify: wl_buffer.release, but since
wl_buffer.release is poorly suitable for hardware buffers
(implies CPU/GPU synchronization), we may want real sync objects
that can be off-loaded to GPUs, especially with dma-buf and
multi-GPU.

- target-msc: presentation.queue with UST target time, requires
computing the desired UST from the given MSC and historical
feedback data

- divisor: ignored
- remainder: ignored

- target-crtc: N/A

- options:
	* PresentOptionAsync: ignored, presentation is always synced to
	vblank
	* PresentOptionCopy: N/A, must be implemented as a copy in the X
	server
	* PresentOptionUST: presentation.queue with UST target time

PresentPixmap corresponds to wl_surface.commit.

There are differences between Wayland and X11 when the target
timestamp is already in the past when the server processes it.
Unsure if queueing from XWayland xserver works out.

PresentNotifyMSC would require an xwayland extension to get a callback
event when the output's MSC reaches a certain value, or if the
xwayland-xserver can estimate the UST from the target-msc it can simply
use a timer. But probably wants the extension, really.

PresentQueryCapabilities:
- PresentCapabilityAsync is never true
- PresentCapabilityFence is missing Wayland protocol
- PresentCapabilityUST implies that the display can refresh on-demand;
enable this always so that X11 apps would prefer UST timestamps instead
of MSC? Or add xwayland extension to ask the compositor?

PresentCompleteNotify as reply to PresentPixmap:
both presentation_feedback.presented/discarded and wl_buffer.release to
determine 'mode'
- PresentCompleteModeSkip if discarded and release
- PresentCompleteModeFlip if presented and no release
- PresentCompleteModeCopy if presented and release
- discarded and no release should not happen if xwayland does not submit
the same wl_buffer to multiple surfaces or queue multiple times.

PresentIdleNotify:
triggered by wl_buffer.release with 'idle-fence' as None, or if using a
fence, then wl_buffer.release signals the fence while PresentIdleNotify
is delivered ahead of time.

Thanks,
pq