[BUG] etnaviv: broken timeouts

Arnd Bergmann arnd at arndb.de
Mon Nov 6 12:42:10 UTC 2017


On Mon, Nov 6, 2017 at 11:35 AM, Lucas Stach <l.stach at pengutronix.de> wrote:
> +CC Arnd, who seems to care about timeout and year 2038 issues.
>
> Am Freitag, den 03.11.2017, 18:41 +0000 schrieb Russell King - ARM
> Linux:
>> On Fri, Nov 03, 2017 at 07:07:44PM +0100, Christian Gmeiner wrote:
>> > If you send this change as patch you can add my
>> > Reviewed-by: Christian Gmeiner <christian.gmeiner at gmail.com>
>> >
>> > Hopefully Lucas will pick it up your patch soon.
>>
>> I would really like to hear from Thomas on it due to the problem
>> that remains - that is that timespec_compare() will return negative
>> if timeout->tv_sec is numerically less than ts.tv_sec.
>>
>> So, when we get close to 2037, and our timeout overflows the wrap,
>> timeout->tv_sec becomes negative, and is immediately numerically
>> less than ts.tv_sec, so this function starts failing.
>>
>> It's not just "current time close to 2038" - if we're reading the
>> current time and adding (eg) (u32)~0 milliseconds to it (as my
>> DDX driver does for "infinite") we'll start seeing problems about
>> 50 days prior to the roll over.
>>
>> Using the 64-bit math version in the kernel may help push the date
>> up to the roll-over date, but doesn't solve the problem.
>>
>> I think we have a few years before we need to be concerned about it,
>> but I'd like to have a solution.
>>
>> Kind'a feels like we need some way to ask the kernel for an
>> drm_etnaviv_timespec that is (u64) X nsec into the future - that'd
>> also eliminate the "which clock_gettime() clockid should we be using"
>> question.
>
> There is a bunch of drivers using absolute timeouts and we probably
> need to deal with 32bit systems for a while still. I would really like
> to see a solution for this which isn't specific to etnaviv, but adds a
> core helper for those situations which has the blessing of the clock
> guys. Hopefully Arnd or Thomas can provide some advise.

As far as I can tell, the driver does not suffer from an overflow, since
all the timeouts that get passed from user space are in
CLOCK_MONOTONIC time, which never overflows a 32-bit seconds
value unless you run the kernel uninterrupted for 136 years without
rebooting.

There is also no problem with the user space ABI, since the
drm_etnaviv_wait_fence structure uses fixed-length fields
rather than 'time_t' or 'struct timespec' that will get redefined by
glibc to use 64-bit timestamps in the next few years. (thanks
a lot to whoever was paying attention while defining the ioctls!)

However, there are a few other problems with the code in question:

- If I understand the earlier comments right, it makes the
  assumption that 'jiffies' and  'CLOCK_MONOTONIC_RAW'  have
  the same time base. Generally speaking, 'jiffies' is an implementation
  detail of the kernel, and user space interfaces should not rely on that.
  If user space actually uses CLOCK_MONOTONIC, it would be
  subject to drift, for CLOCK_REALTIME we would never expire,
  and there is the problem of the INITIAL_JIFFIES constant that
  I think puts CLOCK_MONOTONIC_RAW 5 minutes ahead of
  jiffies.
  I think we should replace references to the global 'jiffies' variable
  in etnaviv with ktime_get_*() calls to make it safer.  I don't actually
  recall if we apply the INITIAL_JIFFIES to CLOCK_MONTONIC,
  if not, then the timeouts are always 5 minutes off...

- The etnaviv_timeout_to_jiffies() function doesn't work if someone
  passes a timeout that is more than LONG_MAX jiffies in the
  future, e.g. if someone was to pass tv_sec=S64_MAX from
  user space, there is a 50% chance that it returns 0, while the
  other 50% of time we get a random 32-bit number of jiffies
  to wait for, depending on how long the system has been running.
  The expected result would be to wait indefinitely.

- Using 'timespec' in drivers is generally a bad idea because of the
  overflow that Russell mentioned. We have already converted
  hundreds of drivers to use either 'timespec64' or ktime_t instead,
  which is generally a safer option. I hope to get rid of all uses of
  timespec in the kernel at some point, and I see that as a prerequisite
  for shipping products that want to live beyond 2038.

- less of a problem but still relevant here, my recommendation is
  to define new user space interfaces for timestamps and timeouts
  only in terms of 64-bit CLOCK_MONOTONIC nanoseconds,
  anything else mainly adds complexity but doesn't help much
  otherwise.

I have prototyped a patch to address all of the above, will send
that later today after doing a bit of build testing. I hope you can
give it some testing, and clarify what the timebase is that
user space expects to put into the timeout (MONOTONIC,
MONOTONIC_RAW, or REALTIME).

       Arnd


More information about the etnaviv mailing list