[PATCH weston 1/1] compositor: Abort on bad page flip timestamps

Fri Nov 7 01:36:30 PST 2014

On Fri, 07 Nov 2014 07:04:16 +0100
Mario Kleiner <mario.kleiner.de at gmail.com> wrote:

> On 06/11/14 17:42, Pekka Paalanen wrote:
> > On Thu, 06 Nov 2014 16:08:56 +0900
> > Michel Dänzer <michel at daenzer.net> wrote:
> >
> >> On 06.11.2014 03:06, Frederic Plourde wrote:
> >>> Many features, like animations, hardly depend on page flip timestamps
> >>> to work properly, but some DRM drivers do not correctly support page flip
> >>> timestamps (or not at all) and in that case, things start to go wrong.
> >>>
> >>> This patch adds sanity check to weston_output_finish_frame. By solely
> >>> verifying that page flip timestamps are monotonically increasing, we
> >>> make sure that :
> >>>
> >>> 1) Underlying driver is not throwing zeroed-out timestamp series at us.
> >>> 2) We have not mistakenly jumped backwards because of integer overflow.
> >>>
> >>> If a pathological case is detected, we gracefully exit Weston
> >>> with an appropriate exit code to help developers debug their drivers.
> >>
> >> That seems a bit harsh. IIRC, zero can be returned for the timestamp
> >> intermittently if no accurate timestamp value can be determined, e.g.
> >> because the CRTC is disabled. At the very least, I'd recommend
> >> double-checking this with Mario Kleiner (Cc'd) and the dri-devel mailing
> >> list.
> >
> > Can that really happen if we are not doing stupid things like
> > attempting to flip on a disabled crtc or output?
> >
> 
> I don't think it could happen for non-stupid regular use and would 
> indicate a driver bug.
> 
> > Or can it happen, if we schedule a flip, then disable the crtc
> > before the flip completes? Or maybe when an output is hot-unplugged?
> >
> 
> On kernels <= 3.17 disabling vblank irqs will clear the cached timestamp 
> to zero, so a waitvblank ioctl() for a pure query of msc/ust could 
> return zero as a signal of "invalid/undefined timestamp" on some drivers 
> under some circumstances. Basically a -EAGAIN error code.
> 
> But that shouldn't ever happen for kms-pageflip completion timestamps, 
> because vblank irqs don't get disabled while a pageflip is pending, as 
> the pending pageflip keeps the vblank reference count > 0.
> 
> There are also some new patches into Linux 3.18rc which should prevent 
> returning zero timestamps even for pure waitvblank ioctl() queries if 
> vblank irqs get disabled for one reason or the other, cfe.
> 
> http://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/commit/drivers/gpu/drm/drm_irq.c?id=844b03f27739135fe1fed2fef06da0ffc4c7a081
> 
> kms drivers should usually use the drm_send_vblank_event() helper in 
> their pageflip completion routine. That helper will ideally send out the 
> cached vblank count and high precision timestamps which were computed 
> during the most recent vblank interrupt for the crtc that pageflipped. 
> There's also some fallback there which, if no crtc is specified (crtc == 
> -1), will simply return a msc of zero and the current system time as 
> timestamp. The fallback is for gpus with weird flip completion 
> behaviour. Currently nouveau uses it for old < NV-50 gpu's where we 
> couldn't find a better way to make pageflip completion behave 
> properly/reliably due to some hardware weirdness.
> 
> > Is zero a special timestamp that simply cannot be produced during
> > normal operations, like due to clock wrap-around?
> >
> 
> The timestamps are CLOCK_MONOTONIC uint32 seconds.microseconds since 
> bootup, so a wraparound would ony happen after 2^32 seconds or 136 
> years, so normal operation shouldn't cause a zero timestamp.
> 
> So i think observing zero pageflip timestamps would be a sign that the 
> kms-driver needs fixing.

Cool, thank you for that. So zero timestamp is basically a driver bug.

But, is the assumption of monotonic timestamps similarly valid? Should
we consider getting the same timestamp more than once from a page-flip
completion event to be a driver bug or not? I'm not quite sure as you
mentioned the sending of cached timestamps.

If we still feel that exiting Weston on the first timestamp failure
(zero or not increasing) is too harsh, we could simply log a warning
and add again some heuristics on when things are too far wrong to
continue.

E.g. getting a single timestamp failure during the first 5 seconds of
compositor uptime warrants an exit, but later probably not. We want to
catch cases where the compositor cannot reliably work to begin with,
but killing the user's session much later is pretty blunt if it already
worked somewhat.

Would that be good?

I would prefer to not fall back to clock_gettime() for timestamps on
DRM, so that there would be more pressure to fix the kernel drivers if
any are broken.

Thanks,
pq