[RFC] wl_surface video protocol extension

Fri Oct 18 09:30:16 CEST 2013

Hi James,

I'm not a video expert, so excuse me if my view is too simplistic. What
I write is what I'd call common sense, and I'm sure literature would
have specialized algorithms for exactly the things we need, like
prediction.

But, I don't want to tie the protocol into any certain algorithm. The
protocol should be as simple as possible while carrying all the
hard facts, and below I am arguing that I do not see other facts we
could have than the ones already mentioned, and I believe those facts
are sufficient for a sophisticated algorithm to estimate everything
necessary in a client.

On Thu, 17 Oct 2013 21:34:02 +0100
James Courtier-Dutton <james.dutton at gmail.com> wrote:

> The key point I was trying to make is that the media player needs to be
> able to predict when frames will be displayed.

Yes, the *player* needs to be able to predict, and we aim to give it the
historical information to do exactly that.

> All its prediction calculations will be based on a fixed scanout rate.

Why only fixed scanout rate? Why can't you dynamically adapt to the
latency between requested and realized presentation time, over time?

There are lots of ways to predict based on past measurements and
taking account uncertainty, Kalman filter is the first one to come
my mind, but I guess there are better methods for meeting a
deadline/cycle kind of scenarios.

> If the scanout rate changes, all bets are off, and we have to start again
> from scratch trying to predict the scanouts again.
> So, with regards to the scanout rate changing, we will need an event to
> tell us that so we can redo the prediction calculations.

Yes, we have thought about such an event, and it seems a good idea.

> So, if the scanouts are T=100 and then T=200, we would have to have somehow
> predicted that, and as a result set the
> buffer1, T=100
> buffer2, T=200
> or in actual fact, to allow for inaccuracy the media player would set:
> buffer1, T = 80
> buffer2, T = 180
> This will then ensure that buffer1 would be displayed at T=100, and buffer2
> would be displayed at T=200.
> 
> But we would only know to set T=80 for buffer1 if we had been able to
> predict when the actual frame will be.
> So, I guess I am still asking for an API that would return these predicted
> values from the graphics driver.

I don't think drivers predict anything, do they? Right now, drivers
only tell you what happened (the presentation timestamp), if even that.
AFAIK, most popular drivers do not currently even have queues. You can
schedule *one* page flip at a time, and you cannot even cancel that
later if you wanted to, from what I've heard.

In the future I'm hoping we would have drivers where you could actually
queue buffers for hardware overlays, so the buffer queue does not need
to live completely in the userspace on the mercy of scheduling hickups
and system load. But even then I can't imagine drivers doing any
predictions, they would just try to meet requested presentation times.
If drivers predict, they will do it internally only to make sure the
requested presentation time is realized. *How* it is realized is a good
question (never show before the requested time? or make sure is on
screen at the requested time? or...)

The compositor is similar to drivers in my opinion. If it tries to
predict something, it will only do that to be able to meet the
requested presentation times. Who knows, maybe predictions in the
compositor and drivers would actually make prediction in the clients
worse by introducing more variables.

In any case, I do not see the compositor being responsible for
predicting times for clients' buffers. In my view, it is clients'
resposibility to predict in any way they want, based on hard facts of
the past given by the compositor and drivers. The client (player) is the
only one knowing everything relevant, like the original video frame rate
and audio position, and whether it wants to prefer showing frames early,
late, when to skip, etc.

If you are concerned about video start, where you do not yet have
measurements available to predict from, then yeah, all you have is the
current monitor framerate. But after the first frame has been
presented, you get the first presentation timestamp, which should allow
you to adapt to the phase of the scanout cycle. Right?

And of course repeat that "coarse calibration" every time scanout rate
(monitor refresh rate) changes.

While the presentation loop is running, you could theoretically adjust
the gap between requested and realized presentation timestamps. If the
gap gets too small and the frames start to miss their intended scanout
cycle, you see it in the realized timestamps, and can back off. It is a
trial and error, but I'm hoping it will take only a few frames in the
beginning to reach a steady state and lock the gap length, after which
the predictor in the player should be in sync.

> Also, If buffer2 has T=200. What is the 200 being compared to in order to
> decide to display the buffer or not?
> This will not be the timestamp on the presentation callback will it?

Yes, it will, if you mean clock domains. There needs to be a way to have
all timestamps related to frames in the same clock domain, of course. An
idea we also already discussed but didn't mention yet, is the
compositor to tell clients which clock domain it uses for all frame
timing purposes. Then clients can ask the current time directly from
the kernel if they need it, and relate it to the audio clock.

It is a good question, what does the requested presentation time
actually mean. Should the system guarantee, that the frame is already
on screen at that time, or that it is not shown before that time?
Luckily, the realized presentation time can be defined exactly, e.g.
the time when the hardware starts to scan out the new image. The
realized time is the one that matters, and the specific meaning of
requested time is not so important, since a client can adapt. No?

> There is also the matter of clock sync. Is T=100 referenced to the system
> monotonic clock or some other clock.
> Video and Audio sync is not achieved by comparing Video and Audio time
> stamps.
> You have a global system monotonic clock, and sync Video to the system
> clock, and sync Audio to the system clock.
> A good side effect of this is that Audio and Video are then in sync.
> The advantage of this syncing to the system monotonic clock is that you can
> then run Video and Audio in separate thread, different CPUs, whenever, but
> they will always be in sync.

Some GStreamer people have told me that they actually prefer to use the
audio clock as the master clock, and infer everything else from that.
Anyway, yes, clock domains are important.

I would like to know if I have understood something wrong in this whole
video thing.

Thanks,
pq

PS. please use reply-to-all.