[RFC] wl_surface video protocol extension

Mon Oct 21 21:45:35 CEST 2013

Hi,

On 18 October 2013 11:52, James Courtier-Dutton <james.dutton at gmail.com> wrote:
> On 18 October 2013 08:30, Pekka Paalanen <ppaalanen at gmail.com> wrote:
>> On Thu, 17 Oct 2013 21:34:02 +0100
>> James Courtier-Dutton <james.dutton at gmail.com> wrote:
>> > All its prediction calculations will be based on a fixed scanout rate.
>>
>> Why only fixed scanout rate? Why can't you dynamically adapt to the
>> latency between requested and realized presentation time, over time?
>
> Conceptually, the player is not requesting a presentation time. It can't.
> The scanout does not change to match the player.

No, of course it can't.  But when the player generates one particular
video buffer, it knows it must be displayed at a particular time to
match the audio which goes with the frame.

> The player predicts that a particular frame scanout will happen at time X,
> and the player then decides what it wants displayed for that scanout.
> So what we need is:
> 1) Predictable scanout times.  <- We can get this from the historical
> callback timestamps as you mentioned.
> 2) A Deterministic way to ensure that the frame the player wants to appear
> at a particular scanout does actually appear then.
> It is this second point that I am not sure we have yet.

This is what we're working towards with the extension.

>> I don't think drivers predict anything, do they? Right now, drivers
>> only tell you what happened (the presentation timestamp), if even that.
>> AFAIK, most popular drivers do not currently even have queues. You can
>> schedule *one* page flip at a time, and you cannot even cancel that
>> later if you wanted to, from what I've heard.
>
> The current situation is not ideal, so players can only achieve best effort.

Short a real-time OS with a totally unchanging UI and a wholly
predictable stream of data coming in, it's never going to be ideal.

> But when you can get deterministic frame display, the player can do a much
> better job.
> I difference is noticeable to the person watching. The artifact that is most
> noticeable is jitter, caused by missing a scanout or the wrong frame
> appearing on the wrong scanout.

Yes, I agree.  Well, that and bad A/V sync, but at least the latter's constant.

>> In the future I'm hoping we would have drivers where you could actually
>> queue buffers for hardware overlays, so the buffer queue does not need
>> to live completely in the userspace on the mercy of scheduling hickups
>> and system load. But even then I can't imagine drivers doing any
>> predictions, they would just try to meet requested presentation times.
>> If drivers predict, they will do it internally only to make sure the
>> requested presentation time is realized. *How* it is realized is a good
>> question (never show before the requested time? or make sure is on
>> screen at the requested time? or...)
>
> For low power applications, like smart phones with a separate APU and GPU,
> there is a possibility to send say 10 frames from the APU to the GPU. The
> APU can then sleep for 10 frames, while the GPU outputs the frames at all
> the correct scanouts. The APU then wakes up to gather the next 10 frames.
> So, in this scenario, the API between the APU and GPU would need to be able
> request the previous 10 frame scanout times stamps, because an individual
> frame callback would not work because the APU was sleeping.

The proposed frame-time feedback API (which should be sent out as an
RFC quite soon) doesn't actually look like this:
wl_surface.attach()
wl_surface.commit()
wl_surface.attach()
wl_surface.commit()
[...]
wl_surface.request_last_presentation_times(10)

But rather like this:
wl_surface.attach()
presentation_time = wl_presentation_request.presentation_time()
wl_surface.commit()
[compositor fires event on presentation_time]

Having a blocking reply-based API is totally unworkable, since the
only thing it actually does is introduce jitter, when your entire
pipeline is blocked on a reply back from the compositor.

>> In any case, I do not see the compositor being responsible for
>> predicting times for clients' buffers. In my view, it is clients'
>> resposibility to predict in any way they want, based on hard facts of
>> the past given by the compositor and drivers. The client (player) is the
>> only one knowing everything relevant, like the original video frame rate
>> and audio position, and whether it wants to prefer showing frames early,
>> late, when to skip, etc.
>
> If you are mentioning the compositor, a nice feature would be to be able to
> send the video stream to the compositor, then also an overlay to be mixed
> with the video stream just before display. Both the video stream and the
> overlay need a way to make sure they are displayed at the correct scanout.
> There is also a need for another overlay to not need to be deterministic.
> E.g. 1 Video stream. One overlay stream for subtitles. <- need to be in sync
> for eg. karaoke.
> second overlay for non-deterministic item such as menus, program guide
> overlay on video etc.

The wl_subsurface extension does this already.  But the reasoning
reinforces what I was saying above: imagine you have a YUV video
stream being displayed directly, which is able to be displayed on an
overlay.  Now you open a menu with an ARGB subtitle texture atop it,
and you have to route it through the GPU.  Because it's a relatively
slow mobile GPU, the texture setup time (and thus your latency) blows
out by about 16ms.

This is why we want to have events, so clients can start adjusting as
soon as possible.

>> If you are concerned about video start, where you do not yet have
>> measurements available to predict from, then yeah, all you have is the
>> current monitor framerate. But after the first frame has been
>> presented, you get the first presentation timestamp, which should allow
>> you to adapt to the phase of the scanout cycle. Right?
>
> So long as there has not been a frame rate change recently, historical data
> should be available.
> If you are changing the frame rate at the very start of the video, you can
> normally wait a few frames before starting the video, allowing the display
> to settle.

No, having startup time be that long fails requirements set by many CE
products (e.g. time to switch channels on a TV, time to start a video
on a phone).  I really want to start displaying frames absolutely as
soon as I have them, and then calibrate the latency during those first
few frames.  Sure they won't be perfect, but not only are the users
less likely to notice as the UI is changing around the video, but the
first few seconds of video are unlikely to have meaningful content
requiring strict sync.  And it takes a little while for the human
brain to settle in anyway.

If players have historical data of their own from similar situations,
they can of course guess that the latency will be similar and start
with that bias pre-applied, until they get feedback from the
compositor.

>> While the presentation loop is running, you could theoretically adjust
>> the gap between requested and realized presentation timestamps. If the
>> gap gets too small and the frames start to miss their intended scanout
>> cycle, you see it in the realized timestamps, and can back off. It is a
>> trial and error, but I'm hoping it will take only a few frames in the
>> beginning to reach a steady state and lock the gap length, after which
>> the predictor in the player should be in sync.
>
> In ALSA (Linux sound) there is a concept of "delay".
> This is a measure of "If I submit the sound sample now, when will it be
> played?" I.e. It will be played in 20ms. The returned value depends on the
> amount of sound samples already queued, but not yet "SoundOut".
> For Video, this would be "If I submit a frame now, which scanout will it be
> displayed on?". Can we get an answer for this question through some API?

Well, you could try, but it depends.  If you have a buffer ready for
presentation (but aren't actually presenting it for some reason), you
could of course force a dummy submission and see how long it takes
through the compositor.  That would at least answer the question of
whether it's going through an overlay or a GPU.  And if it's an
overlay, you could mostly guess from historical data, just as clients
could.  But if it's a GPU, you don't know how long it's actually going
to take to display: how heavily loaded is the GPU?

If it was possible to get an upfront answer in the form of one simple
number, we'd just do that.

>> Yes, it will, if you mean clock domains. There needs to be a way to have
>> all timestamps related to frames in the same clock domain, of course. An
>> idea we also already discussed but didn't mention yet, is the
>> compositor to tell clients which clock domain it uses for all frame
>> timing purposes. Then clients can ask the current time directly from
>> the kernel if they need it, and relate it to the audio clock.
>
> Will the graphics card have its own clock, being used for scanout clocking.
> Can this be read by the user? Can it use that clock to timestamp the
> presented frames?

Sometimes yes, sometimes no - in which case the only useful answer is
for the GPU driver to convert to a common base like CLOCK_MONOTONIC.
In this case, it's either using its own calculations, or just taking
the simple approach of checking the clock when it gets a vblank IRQ.

> One of the problems with PCs is that the graphics card has its own clock,
> the sound card has its own clock, and the kernel has its own clock.
> The software then has to somehow match them all up.
> Embedded systems are generally better and can use the same hardware based
> clocking for graphics card and sound card and kernel clock.

Actually, they're much worse, since the clock domains are absolutely
huge, not necessarily static, and are constantly changing in reaction
to load, thermal constraints, etc.

Either way, I can't remember seeing one with the GPU and display
controller sharing a clock, let alone display controller and audio.

>> It is a good question, what does the requested presentation time
>> actually mean. Should the system guarantee, that the frame is already
>> on screen at that time, or that it is not shown before that time?
>> Luckily, the realized presentation time can be defined exactly, e.g.
>> the time when the hardware starts to scan out the new image. The
>> realized time is the one that matters, and the specific meaning of
>> requested time is not so important, since a client can adapt. No?
>
> The client can adapt so long as it known which clock domain each of the
> timestamps are in.

It's a slightly more subtle semantic question than that, though:
whether the extension is proposing 'not before' or 'not after'.
Personally, I'd shoot for the former.

>> Some GStreamer people have told me that they actually prefer to use the
>> audio clock as the master clock, and infer everything else from that.
>> Anyway, yes, clock domains are important.
>
> This method is not always best. Some sound cards are very bad at providing
> an audio clock.
> xine provides several different sync/clock methods so the user can select
> the best one for their hardware.

Either way, it doesn't impact on Wayland extension design.

>> I would like to know if I have understood something wrong in this whole
>> video thing.
>
> I think how wayland can help provide the "deterministic" angle is not yet
> clear to me.

Nothing is deterministic.  Is your kernel scheduling your application
in time? And the compositor? Is your application actually in memory,
or are you frantically paging everything in? Has someone just blown
out the cache? Is your data actually streaming OK, or are you waiting
on I/O? Has someone just submitted a huge job to the GPU (which can be
as simple as texture setup) which is going to grief all its load? Did
a notification with an overlay just trigger and demote you from an
overlay to the GPU? Did the display controller hiccup for a frame
during a voltage/clock change? Did your constant frame submission
earlier run the system so hot it's throttled its clock down so far you
can no longer meet your frame deadlines?

There's no plan to provide an absolute hard ironclad guarantee on
scheduling, because all of these things are out of our control, and
most of them are fundamentally unsolvable.  What we're suggesting is a
two-part solution: before submission, allow applications to queue
(multiple) frames along with a list of target times which the
compositor will attempt (not guarantee) to meet; after submission,
tell the application when the frame was actually displayed, and allow
them to adjust their submission times accordingly.

Cheers,
Daniel