[BUG] etnaviv: broken timeouts

Russell King - ARM Linux linux at armlinux.org.uk
Mon Nov 6 15:01:47 UTC 2017


On Mon, Nov 06, 2017 at 03:43:37PM +0100, Arnd Bergmann wrote:
> On Mon, Nov 6, 2017 at 3:17 PM, Arnd Bergmann <arnd at arndb.de> wrote:
> > On Mon, Nov 6, 2017 at 3:04 PM, Russell King - ARM Linux
> > <linux at armlinux.org.uk> wrote:
> >> On Mon, Nov 06, 2017 at 01:42:10PM +0100, Arnd Bergmann wrote:
> >>> On Mon, Nov 6, 2017 at 11:35 AM, Lucas Stach <l.stach at pengutronix.de> wrote:
> >> Could you look at the patches I proposed, which have been tested,
> >> instead of recreating my work?
> >>
> >> If you want to recreate my work, we're going to have to wait another
> >> 50 odd days before we can say whether they work or not, because that's
> >> how long it takes for the current code to go wrong, and I've already
> >> rebooted the platforms.
> >
> > I'll have a look, sure.
> 
> I found your patches in the web archive (the timespec and timespec64
> based ones) now, and they look correct regarding the INITIAL_JIFFIES
> and NTP clock drift issues, as well as the overflow of the jiffies.
> 
> However, the comment you introduce is a bit misleading in the 'timespec'
> version:
> 
> > +/*
> > + * Etnaviv timeouts are specified wrt CLOCK_MONOTONIC, not jiffies.
> > + * We need to calculate the timeout in terms of number of jiffies
> > + * between the specified timeout and the current CLOCK_MONOTONIC time.
> > + * Note: clock_gettime() is 32-bit on 32-bit arch.  Using 64-bit
> > + * timespec math here just means that when a wrap occurs, the
> > + * specified timeout goes into the past and we can't request a
> > + * timeout in the future: IOW, the code breaks.
> > + */
> 
> We already know that we will fix clock_gettime() to return 64-bit
> seconds in the future, and at that point the comment no longer
> makes any sense.
> 
> We also know that CLOCK_MONOTONIC doesn't overflow 32-bit
> interfaces in practice, and we rely on that in lots of places in the
> kernel, but we still use ktime_t or timespec64 with clock_monotonic
> for consistency with the future fixed user space.

If CLOCK_MONOTONIC doesn't overflow the 31-bit value (so doesn't go
negative) we still end up with times in the future via the 64-bit
timeout API, because what userspace will do is something like this:

foo(struct drm_etnaviv_timespec *ts, unsigned ms)
{
	struct timespec now;

	clock_gettime(CLOCK_MONOTONIC, &now);

	ts->tv_sec = now.tv_sec + timeout / 1000;
	ts->tv_nsec = now.tv_nsec + (timeout % 1000) * 1000 * 1000;
	if (ts->tv_nsec > 1000 * 1000 * 1000) {
		ts->tv_nsec -= 1000 * 1000 * 1000;
		ts->tv_sec += 1;
	}
}

What that code ends up doing is quite subtle, and probably not what
is obvious.  ts->tv_sec is a unsigned 64-bit variable.  now.tv_sec
is a 32-bit signed variable.

What we actually end up with is ts->tv_sec being assigned a 32-bit
unsigned quantity which is the addition of those two values.

This means that as now.tv_sec approaches the 68 year limit, the
timespec remains in the future, and as time progresses, up to 136
years, it would still appear to be correct, but as soon as the
timeout causes the 32-bit addition to wrap things will go wrong
in terms of the number passed via the drm_etnaviv_timespec.

However, if we use the timespec version in the kernel, then the
68 year limit still applies - although tv_sec in drm_etnaviv_timespec
is a positive number, when casted down to a timespec, it becomes a
negative number.  So, things go wrong until the 32-bit ktime_get()
also wraps to a negative number.  This means the failure is temporary
and self-recovers.

With the 64-bit timespec, we maintain the idea of timestamps between
68 and 136 years being in the future, but once the 136 year wrap
occurs, the drm_etnaviv_timespec will permanently be in the past.
In other words, the failure becomes a permanent failure which can
only be recovered by a reboot.

I'm not saying that one or other is better, just pointing out the
effects of the approaches.

Note that the man page for clock_gettime() doesn't define what
happens to CLOCK_MONOTONIC when it gets to the wrap point.  Neither
does the open group spec issue 7.

-- 
RMK's Patch system: http://www.armlinux.org.uk/developer/patches/
FTTC broadband for 0.8mile line in suburbia: sync at 8.8Mbps down 630kbps up
According to speedtest.net: 8.21Mbps down 510kbps up


More information about the etnaviv mailing list