[PATCH 2/6] compositor-drm: Allow instant start of repaint loop.

Mon Apr 13 03:57:31 PDT 2015

On Sun, 12 Apr 2015 23:56:57 +0200
Mario Kleiner <mario.kleiner.de at gmail.com> wrote:

> On 04/08/2015 11:14 AM, Pekka Paalanen wrote:
> > On Wed, 08 Apr 2015 03:12:25 +0200
> > Mario Kleiner <mario.kleiner.de at gmail.com> wrote:
> >
> >> On 04/07/2015 09:34 AM, Pekka Paalanen wrote:
> >>> On Sat, 04 Apr 2015 19:45:10 +0200
> >>> Mario Kleiner <mario.kleiner.de at gmail.com> wrote:
> >>>
> >>>> On 04/02/2015 01:37 PM, Pekka Paalanen wrote:
> >>>>> On Thu,  2 Apr 2015 07:10:50 +0200
> >>>>> Mario Kleiner <mario.kleiner.de at gmail.com> wrote:
> >>>>>
> >>>>>> drm_output_start_repaint_loop() incurred a delay of
> >>>>>> one refresh cycle by using a no-op page-flip to get
> >>>>>> an accurate vblank timestamp as reference. This causes
> >>>>>> unwanted lag whenever Weston exited its repaint loop, e.g.,
> >>>>>> whenever an application wants to repaint with less than
> >>>>>> full video refresh rate but still minimum lag.
> >>>>>>
> >>>>>> Try to use the drmWaitVblank ioctl to get a proper
> >>>>>> timestamp instantaneously without lag. If that does
> >>>>>> not work, fall back to the old method of idle page-flip.
> >>>>>>
> >>>>>> This optimization should work on any drm/kms driver
> >>>>>> which supports high precision vblank timestamping.
> >>>>>> As of Linux 4.0 these would be intel, radeon and
> >>>>>> nouveau on all supported gpu's.
> >>>>>>
> >>>>>> Signed-off-by: Mario Kleiner <mario.kleiner.de at gmail.com>
> >>>>>> ---
> >>>>>>     src/compositor-drm.c | 41 ++++++++++++++++++++++++++++++++++++++++-
> >>>>>>     1 file changed, 40 insertions(+), 1 deletion(-)
> >>>>>>
> >>>>>> diff --git a/src/compositor-drm.c b/src/compositor-drm.c
> >>>>>> index fe59811..4a7baa1 100644
> >>>>>> --- a/src/compositor-drm.c
> >>>>>> +++ b/src/compositor-drm.c
> >>>>>> @@ -225,6 +225,9 @@ static const char default_seat[] = "seat0";
> >>>>>>     static void
> >>>>>>     drm_output_set_cursor(struct drm_output *output);
> >>>>>>
> >>>>>> +static void
> >>>>>> +drm_output_update_msc(struct drm_output *output, unsigned int seq);
> >>>>>> +
> >>>>>>     static int
> >>>>>>     drm_sprite_crtc_supported(struct drm_output *output, uint32_t supported)
> >>>>>>     {
> >>>>>> @@ -704,6 +707,12 @@ err_pageflip:
> >>>>>>     	return -1;
> >>>>>>     }
> >>>>>>
> >>>>>> +static int64_t
> >>>>>> +timespec_to_nsec(const struct timespec *a)
> >>>>>> +{
> >>>>>> +	return (int64_t)a->tv_sec * 1000000000000LL + a->tv_nsec;
> >>>>>
> >>>>> Hi,
> >>>>>
> >>>>> are you sure this cannot overflow? I think tv_sec could be a int64_t.
> >>>>>
> >>>>> That no-one gets the idea of initializing the clock in the kernel to
> >>>>> some huge value just to fish out this kind of overflows?
> >>>>>
> >>>>
> >>>> I almost literally copied it from compositor.c, your repaint delay
> >>>> handling. I think it can't overflow the way it is used here to only
> >>>> remap from kernel timestamps. The kernel delivers its timestamps in
> >>>> "struct timeval", so tv_sec (== long int) from kernel could be a 64-bit
> >>>> int on 64-Bit kernels, but the values put into it are always derived
> >>>> from ktime_t which is "nanoseconds stored in 64 bit signed integers". So
> >>>> at least with the current kernel implementation it can't overflow. And
> >>>> 32-Bit kernels would hit such overflows before user space as their
> >>>> struct timeval seems to have 32 bit int for tv_sec. So i think with the
> >>>> current kernel ioctl interfaces it should be safe.
> >>>
> >>> Hi,
> >>>
> >>> that's nice to know about the kernel. Isn't tv_sec actually a time_t? I
> >>> once or twice tried to find out what was guaranteed about time_t and I
> >>> couldn't really find much, except that it may be even 64-bit or it
> >>> maybe be less.
> >>>
> >>
> >> Yes. Trying to follow the userspace headers creates headache due to so
> >> many indirections. Looking at the kernel side of timestamps returned by
> >> the DRM ends up with "long" on 32-Bit or 64-Bit kernels = int32_t on
> >> 32-Bit kernel, int64_t on 64-Bit kernel. For the x32 interface it is
> >> defined as "long long" = int64_t with the comment "should match the size
> >> of the regular 64-Bit kernel".
> >>
> >> So tv_sec could be 32-Bit or 64-Bit. But the DRM does all its timestamp
> >> calculations internally as ktime_t which is always Nanoseconds inside
> >> signed int64_t and only converts into struct timeval at the end. That's
> >> why i think it can't overflow because the "kernel payload" we get out of
> >> our struct timespec can't be more than int64_t nanoseconds. The kernel
> >> would overflow things internally before they would reach us, although
> >> with CLOCK_MONOTONIC as timebase that would require an uptime of almost
> >> 300 years, so no immediate danger here.
> >>
> >>> It's hard for me to draw the line between trusting only specs and
> >>> trusting the implementation I have at hand.
> >>>
> >>> Btw. I always try to do a timespec_to_nsec() on a timespec that already
> >>> is a difference, never an absolute time. Below you are subtracting
> >>> nsecs instead of timespecs, so that got me wondering.
> >>>
> >>
> >> I could change it if you want, but then i'd need to duplicate more
> >> helper functions in compositor-drm.c and as you said you'll have some
> >> shared helpers prepared soon anyway i would just wait until those are
> >> merged.
> >
> > Yeah, you don't have to do that. :-)
> > I'm not sure when I'd get to adding those functions, but if/when I do,
> > it's up to me to convert the then current code, so no worries.
> >
> >
> >>>>
> >>>>>> +}
> >>>>>> +
> >>>>>>     static void
> >>>>>>     drm_output_start_repaint_loop(struct weston_output *output_base)
> >>>>>>     {
> >>>>>> @@ -711,7 +720,13 @@ drm_output_start_repaint_loop(struct weston_output *output_base)
> >>>>>>     	struct drm_compositor *compositor = (struct drm_compositor *)
> >>>>>>     		output_base->compositor;
> >>>>>>     	uint32_t fb_id;
> >>>>>> -	struct timespec ts;
> >>>>>> +	struct timespec ts, tnow;
> >>>>>> +	int ret;
> >>>>>> +	drmVBlank vbl = {
> >>>>>> +		.request.type = DRM_VBLANK_RELATIVE,
> >>>>>> +		.request.sequence = 0,
> >>>>>> +		.request.signal = 0,
> >>>>>> +	};
> >>>>>>
> >>>>>>     	if (output->destroy_pending)
> >>>>>>     		return;
> >>>>>> @@ -721,6 +736,30 @@ drm_output_start_repaint_loop(struct weston_output *output_base)
> >>>>>>     		goto finish_frame;
> >>>>>>     	}
> >>>>>>
> >>>>>> +	/* Try to get current msc and timestamp via instant query */
> >>>>>> +	vbl.request.type |= drm_waitvblank_pipe(output);
> >>>>>> +	ret = drmWaitVBlank(compositor->drm.fd, &vbl);
> >>>
> >>> Just to be sure I understood right: this call attempts to get the exact
> >>> time the latest vblank ended?
> >>>
> >>
> >> Yes. In the kernel it accesses the same (vblank counter, timestamp) info
> >> that is used for pageflip completion events for a given vblank,
> >> delivering the same timestamp it would have provided for a pageflip
> >> completing during that vblank, so it is the same as if you had queued a
> >> pageflip a frame earlier and then received the pageflip completion event
> >> for that flip.
> >
> > Excellent.
> >
> >> So this is just a faster way to get a vblank reference timestamp for
> >> restarting the repaint loop, useable for scheduling new repaints for the
> >> next vblank, without need to wait for a frame - and thereby delaying a
> >> frame.
> >>
> >>>>>> +
> >>>>>> +	/* Error return or zero timestamp means failure to get valid timestamp */
> >>>>>> +	if ((ret == 0) && (vbl.reply.tval_sec > 0)) {
> >>>>>
> >>>>> No need to check tval_usec?
> >>>>
> >>>> I don't think it is necessary for this quick crude early-out check and
> >>>> would complicate the check to be almost as expensive as the check it is
> >>>> trying to skip.
> >>>
> >>> I don't think checking one more value is going to be noticeable...
> >>>
> >>
> >> You're right, i was thinking wrong there. I'll change it for clarity.
> >>
> >>>> On kernels <= Linux 3.16 a zero time value tval_sec=tval_usec=0 meant
> >>>> "invalid timestamp, please retry again later". This would happen on kms
> >>>> drivers which don't support high precision instant vblank timestamping
> >>>> if vblank interrupts were disabled (currently all but intel, radeon and
> >>>> nouveau, so essentially all SoC's), because the only way to get a valid
> >>>> timestamp there is to wait for a vblank irq and collect the timestamp
> >>>> there, and we didn't want the "non-blocking" bits of drmWaitVblank ioctl
> >>>> to block for up to a whole video refresh cycle.
> >
> > This [+].
> >
> >>>>
> >>>> The tval_sec > 0 check detects these invalid timestamps and immediately
> >>>> skips to the pageflip fallback below. It would also skip to fallback if
> >>>> weston would somehow manage to run within the first second of kernel
> >>>> boot, but that's very unlikely and all it would cause is 1 extra frame
> >>>> of lag during that second.
> >>>
> >>> Makes perfect sense, yeah, the corner case being essentially irrelevant
> >>> indeed.
> >>>
> >>>> The reason we need the check for stale timestamps at all is because we
> >>>> accidentally removed that "timestamp 0 == invalid ts" signalling during
> >>>> some improvements to the vblank handling in kernel 3.17 which were
> >>>> needed for the atomic modesetting stuff. At the moment timestamps will
> >>>> be always correct and instantaneous on intel/radeon/nouveau-kms, but the
> >>>> kernel would deliver stale timestamps for <= 1 video refresh duration
> >>>> after vblank irqs were turned off and on again on the various SoC
> >>>> display drivers.
> >
> > This [+].
> >
> >>>
> >>> Oh, I didn't know that. Interesting.
> >>>
> >>>> I'll try to prepare a kernel drm patch to fix that signalling, but now
> >>>> we have to deal with at least some broken kernels anyway.
> >>>
> >>> Agreed, need to deal with them for a long time, I believe.
> >>>
> >>>> I think it would be a good idea to find ways to implement the
> >>>> instantaneous high precision vblank timestamping also on more kms
> >>>> drivers for SoC's, as it would be not only good for timing precision,
> >>>> but also for lag reduction and power saving. For any SoC that allows to
> >>>> query the current crtc scanout position via some register this could be
> >>>> easily done using the same shared drm helper routines we already use for
> >>>> intel/radeon/nouveau. Or if they had some hardware vblank timestamp
> >>>> register that could be mapped to CLOCK_MONOTONIC time.
> >>>>
> >>>>>> +		ts.tv_sec = vbl.reply.tval_sec;
> >>>>>> +		ts.tv_nsec = vbl.reply.tval_usec * 1000;
> >>>>>> +
> >>>>>> +		/* Valid timestamp for most recent vblank - not stale? Stale ts could
> >>>>>> +		 * happen on Linux 3.17+, so make sure it is not older than 1 refresh
> >>>>>> +		 * duration since now.
> >>>>>> +		 */
> >>>>>> +		weston_compositor_read_presentation_clock(&compositor->base, &tnow);
> >>>>>> +		if (timespec_to_nsec(&tnow) - timespec_to_nsec(&ts) <
> >>>>>
> >>>>> I'd write this difference in terms of timespec first and convert to
> >>>>> nsec afterwards, but I can do that also as a follow-up when I happen to
> >>>>> visit this part the next time. I have quite some timespec helpers lying
> >>>>> around that I should collect into one place at some point.
> >>>>>
> >>>>
> >>>> If you wanted to introduce some proper timespec helpers anyway then i'd
> >>>> leave it to you when you get around doing that. I went for the easy
> >>>> route here because i didn't want to duplicate too much helper code from
> >>>> other places like compositor.c or open-code this stuff just for one
> >>>> single check.
> >>>
> >>> Cool, sure.
> >>>
> >>>>>> +			(1000000000000LL / output_base->current_mode->refresh)) {
> >>>>>> +			drm_output_update_msc(output, vbl.reply.sequence);
> >>>>>> +			weston_output_finish_frame(output_base, &ts,
> >>>>>> +									   PRESENTATION_FEEDBACK_INVALID);
> >>>
> >>> (*)
> >>>
> >>> Hmm... could this cause weston_output_finish_frame() to be called twice
> >>> with the same timestamp? Up to now I have considered that to be a bug.
> >>
> >> How could that happen? My understanding of the code so far is that
> >> either the repaint loop was active and a output repaint was needed and
> >> therefore a pageflip was queued for a given vblank - and the repaint
> >> loop will be kept active for one more frame - and then pageflip
> >> completion will call weston_output_finish_frame() with the timestamp of
> >> that vblank of completion. Or the repaint loop is exited due to
> >> inactivity for at least 1 frame, in which case there wasn't any pageflip
> >> completing for the last (most recent) vblank, so our query in start
> >> repaint loop will return the so far unique timestamp of that so far
> >> untouched "idle" vblank, so our call to weston_output_finish_frame()
> >> will have a unique timestamp.
> >>
> >> Do i miss some other case?
> >
> > I was thinking of the following:
> >
> > 1. Previously submitted page flip completes,
> >     weston_output_finish_frame() gets called on vblank N.
> >
> > 2. Repaint delay timer starts.
> >
> > 3. Repaint delay timer fires. There are no repaints scheduled, so the
> >     repaint loop deactivates (does not call weston_output_repaint()
> >     anymore at this time.)
> >
> > 4. A client posts new damage during the 7 millisecond window (repaint
> >     window) that exists between step 3 and vblank N+1.
> >
> > 5. Client damage causes Weston to schedule repaint. Since repaint loop
> >     is inactive, start_repaint_loop() gets called.
> >
> > 6. Start_repaint_loop() is called before vblank N+1, which causes
> >     drmWaitVblank to return the timestamp for vblank N.
> >
> > 7. weston_output_finish_frame() gets called with a timestamp for vblank
> >     N a second time (first time in step 1.)
> >
> > Even if this happens, I don't think it's going to be a problem really,
> > like I said.
> >
> 
> You are right, that sequence could happen.
> 
> > Hmm, but if that is possible, then there is a funny side-effect. If we
> > do not force waiting a vblank before repainting, a client that is a
> > little too late might trigger a repaint and still hit the very next
> > vblank. But, if there are multiple such clients, their timings (hitting
> > the very next or the one ofter vblank) would become random depending on
> > who gets to spin the compositor repaint function first.
> >
> > I think that might be a problem, causing timing sensitive clients to
> > fight each other without being able to properly lock on to the
> > compositor repaint cycle.
> >
> > I'm not sure how to fix that while still having the opportunity to do
> > quick repaint activation (without waiting for an extra vblank first)
> > when there is no such race - after an idle time of more than one frame
> > period.
> >
> 
> With this extra hunk we could do it:
> 
> > index 4de8fbf..85eabdf 100644
> > --- a/src/compositor.c
> > +++ b/src/compositor.c
> > @@ -2423,6 +2423,13 @@ weston_output_finish_frame(struct weston_output *output,
> >                 msec = 0;
> >         }
> >
> > +       /* Called from restart_repaint_loop and restart happens already after
> > +        * the deadline given by repaint_msec? In that case we delay until
> > +        * the deadline of the next frame, to give clients a more predictable
> > +        * timing of the repaint cycle to lock on. */
> > +       if (presented_flags == PRESENTATION_FEEDBACK_INVALID && msec < 0)
> > +               msec += refresh_nsec / 1000000;
> > +
> >         if (msec < 1)
> >                 output_repaint_timer_handler(output);
> >         else
> 
> PRESENTATION_FEEDBACK_INVALID allows us to detect that 
> weston_output_finish_frame() was called due to a restart of the repaint 
> loop, iow. due to some client posting damage during a vblank when the 
> loop was stopped. The 'msec' tells us if this happens before or after 
> the repaint window deadline. If msec < 0 we know this is the case where 
> racing between multiple such late clients would happen, so we simply 
> setup the repaint timer to delay until the start of the repaint window 
> of the following frame. This should give the same behaviour as before 
> for late clients - instead of the first late one winning the race, all 
> of them get handled later. Early enough clients still benefit from the 
> fast restart of the repaint loop. Works in my tests, so maybe i should 
> add that to my patch?

Yes, it sounds plausible. I would still like a test case in Weston
where we could see with Wesgr what exactly happens in this
situation. I'd also like to see what happens when Weston's own rendering
misses the vblank.

Maybe it would be best if adding this hunk is a follow-up patch.
Another patch would also modify e.g. presentation-shm demo to have a
test mode that triggers this special case. I suppose this could be
achieved with an additional command line switch to pretend that
rendering takes the given number of milliseconds before attach+commit.
This switch would be applicable to all existing test modes, so it would
be an orthogonal parameter. That way we could test the behaviour before
and after the changes.

With that, we could examine how the time taken by a client to draw
affects the compositor behaviour in all the possible client timing
scenarios: frame callback based, presentation feedback based, and from
idle.

> Another issue separate from this patch is if/how we could provide some 
> kind of "unredirect fullscreen window" behaviour like on X11 compositors 
> when only 1 client is displaying on an output? That's something i'd 
> really like to have on top of this for timing sensitive / graphics 
> intense clients. If we know there's only one client displaying via zero 
> copy pageflips and no time for composition es needed, maybe we could 
> drive the output repaint directly by that client without the need for a 
> repaint_msec margin? Or reduce that margin to 1 msec? At the moment with 
> a repaint window of 7 msecs, one has to give up almost half a frame 
> duration of rendering time. On a 120 Hz display that would be most of a 
> refresh cycle.

You mean reduce the window automatically to a very small one, which
mean the compositor would wait almost a full frame period to submit a
KMS flip just in time?

The other option being to submit a KMS flip as soon as a frame comes
from the client and presume the client manages the timings well.

FWIW, that 7 milliseconds is just something I drew from a hat, and it's
configurable, but not dynamic. It has been proposed to make it dynamic,
but we don't yet have all the necessary support in the gfx stack to
know how long composition actually took and to avoid being delayed by
client GPU jobs.

Well, composition is not necessary, when all visible surfaces can be put
into KMS planes. (I think I have a patch somewhere to make Weston avoid
compositing at all if there is no primary damage this frame.) So, we'd
need to know when that is possible, and it is probed during
weston_output_repaint(), in the assign_planes hook.

The problem is that without probing, we don't know if it is possible.
It's not really enough to check that window geometries and stacking
etc. doesn't change, a client could submit a buffer that suddenly is
not scan-out able. Or the KMS could for unrelated reason stop being
able to scan out from all the surfaces it did before.

We could cache the assign_planes probing results, and if WM state
doesn't change, re-use it assuming it still works. But what to do when
it fails and it's too late to composite?

This is a topic for investigation, yes, but it doesn't look easy to me.
I'd probably suggest to postpone it at least after we have a decent
atomic KMS support in Weston and can use all hw overlays and cursor
planes that exist in a unified fashion. I see atomic support as more
important, and if we had the "unredirect" feature in first, it might
make implementing atomic harder.

Things to implement would be:
- detecting when composition is not necessary, carrying it as a hint
- the assign_planes caching
- some framework to invalidate the cache
- attempt to use cached planes info, and fall back to normal repaint if
  it fails
- ???

> The most elegant solution would be the presentation_queue extension to 
> simply tell the compositor when one needs stuff onscreen, instead of 
> trying to lock onto the fixed cycle, which may be different for each 
> wayland compositor implementation, and hoping for the best. What are the 
> plans for moving presentation_queue forward?

The queueing work has been stalled for a good while.
https://bugs.freedesktop.org/show_bug.cgi?id=83092
It depends on a couple of more fundamental bugs to be fixed first, see
the blockers. None of this seems to be a priority at the moment.

Thanks,
pq