[BUG] etnaviv: broken timeouts

Russell King - ARM Linux linux at armlinux.org.uk
Mon Oct 23 15:21:57 UTC 2017


On Mon, Oct 23, 2017 at 03:57:39PM +0100, Russell King - ARM Linux wrote:
> On Thu, Aug 31, 2017 at 02:49:27PM +0200, Lucas Stach wrote:
> > Hi Russell,
> > 
> > Am Donnerstag, den 31.08.2017, 12:18 +0100 schrieb Russell King - ARM
> > Linux:
> > > I've just stumbled on a bug related to the way we handle fence
> > > timeouts.
> > > 
> > > For DRM_ETNAVIV_WAIT_FENCE, we have:
> > > 
> > > struct drm_etnaviv_wait_fence {
> > >         __u32 pipe;           /* in */
> > >         __u32 fence;          /* in */
> > >         __u32 flags;          /* in, mask of ETNA_WAIT_x */
> > >         __u32 pad;
> > >         struct drm_etnaviv_timespec timeout;   /* in */
> > > };
> > > 
> > > where timeout is:
> > > 
> > > struct drm_etnaviv_timespec {
> > >         __s64 tv_sec;          /* seconds */
> > >         __s64 tv_nsec;         /* nanoseconds */
> > > };
> > > 
> > > The timeout is with respect to the monotonic clock.  If the timeout is
> > > specified far enough in the future, eg:
> > > 
> > > 9088652.2192296410 now 4793684.242296410
> > > 
> > > then rather than waiting, the function returns almost immediately with
> > > ETIMEDOUT.  The requested timeout is equivalent to (uint32_t)~0
> > > milliseconds.
> > > 
> > > In the kernel, we take the drm_etnaviv_timespec, and stick it into a
> > > struct timespec via the TS() macro.  This gets passed to
> > > etnaviv_gpu_wait_fence_interruptible(), which uses
> > > etnaviv_timeout_to_jiffies() to convert to jiffies.  I suspect that
> > > the conversion to jiffies in timespec_to_jiffies() results in a
> > > jiffy value that time_after() believes to be before the current time,
> > > resulting in ultimately a zero jiffy timeout.
> > > 
> > > Merely stracing the X server, or adding a fprintf() is enough to
> > > avoid the problem.
> > > 
> > > If you hit this problem, you'll see "fence finish failed" in the Xorg
> > > log.
> > > 
> > > I think doing the time_after() dance after converting to jiffies is
> > > wrong: if we're going to have an API that accepts absolute time, then
> > > we should handle times that are beyond the ability for us to schedule
> > > the wait correctly.
> > > 
> > > It looks like other APIs that take a timespec or timeval (eg, ppoll(),
> > > select(), pselect()) convert the timespec to a ktime value, which
> > > limits to KTIME_MAX (see time*_to_ktime() and ktime_set()), which is
> > > a much nicer behaviour than that which the etnaviv DRM driver is
> > > currently giving us.
> > > 
> > Are you going to provide a patch for this, or should I take a look at
> > fixing this?
> 
> Okay, the good news is that it's finally triggered again, and I now have
> debugging in both the kernel and X server to work out what's going on.
> 
> There's a whole host of things wrong here.
> 
> Here's the debug from one event:
> 
> X: timed out: to=8868123.581486415 start=4573156.286486415 end=4573156.286602102
> 
> What this is telling me is that CLOCK_MONOTONIC was reporting a time of
> 4573156.286486415s, which we used as the basis for calculating the
> timeout.  The requested timeout was VIV_WAIT_INDEFINITE (32-bit ~0 ms).
> Hence, the calculated timeout was 8868123.581486415s.  However, the
> kernel reported timeout, and CLOCK_MONOTONIC reported 4573156.286602102s
> immediately after - barely 200us later.
> 
> From the kernel side, we have:
> 
> [4573105.970932] timed out: timeout=8868123.581486415 jiffies=1143214071
>  remaining=0 fence=1056827139:1056827138
> 
> The timeout target value corresponds.  This is a 250Hz kernel, so the
> jiffies value corresponds with 1143214071/250s, or 4572856.284s.
> 
> The way etnaviv_timeout_to_jiffies() is currently written, it first
> tries to convert the timeout to a number of jiffies.  The first step
> in that is to convert the timeout to a timespec64, and then to jiffies
> using timespec64_to_jiffies().
> 
> The maximum value that timespec64_to_jiffies() will convert is 4294966s
> on my kernel, gathered by reading the disassembly:
> 
> 00000268 <timespec64_to_jiffies>:
> ...
>      274:       e1c020d0        ldrd    r2, [r0]
>      278:       e3a01000        mov     r1, #0
> ...
>      280:       e3080936        movw    r0, #35126      ; 0x8936
>      284:       e3400041        movt    r0, #65 ; 0x41
>      288:       e1510003        cmp     r1, r3
>      28c:       01500002        cmpeq   r0, r2
>      290:       2a000002        bcs     2a0 <timespec64_to_jiffies+0x38>
>      294:       e30f0fb6        movw    r0, #65462      ; 0xffb6
>      298:       e3430fff        movt    r0, #16383      ; 0x3fff
>      29c:       e89dadf0        ldm     sp, {r4, r5, r6, r7, r8, sl, fp, sp, pc}
> 
> and this returns a maximum of 0x3fffffb6 jiffies, or 1073741750.
> Since the requested timeout is past the maximum number of seconds,
> we return this as "timeout_jiffies".
> 
> start_jiffies is 1143214071, and we can see immediately that
> time_after(1143214071, 1073741750) is going to be true.
> 
> This causes remaining_jiffies to be zero, as reported in the debug
> printk as "remaining".
> 
> What's more is that we can see that CLOCK_MONOTONIC reported about
> 4573156.286486415s, and the jiffies value was 300s in the past,
> reporting 4572856.284s.
> 
> So, the calculations here are totally bogus.
> 
> I propose the following (build tested only) to fix this, but testing
> it is going to be very problematical - it's taken 52 days to reproduce
> the original problem.  So, the only sensible way to make sure that
> this is correct is to get many eyes on it and review it thoroughly.
> 
> There's a multitude of different time functions, I just picked here
> what looked the most sensible, so this may not be the best solution
> here - maybe Thomas can advise.
> 
> The other thing to consider is what happens if the userspace
> CLOCK_MONOTONIC value wraps - what should the driver behaviour be?
> Converting the userspace absolute timeout figure to a timespec64
> means that it'll be seen to be in the past, and this problem will
> rear its head again.  Maybe we're better to sticking to the
> normal timespec functions so that _our_ idea of monotonic time
> wraps in unison, so the requested timeout continues to work -
> although I'm not convinced timespec*() works across wraps.
> 
> Other things... it looks to me as if timespec64_to_jiffies() isn't
> doing the right thing - why, on a HZ=250 kernel, does it limit
> jiffies to (~0 / 1000)s and not (~0 / 250)s?  (I'm grateful that
> it did, because if it allowed even twice that figure, I'd be
> looking at a 104 day reproduction time for this issue.)
> 
> In any case, what this IMHO shows is that dealing with timeouts in
> terms of tv_sec/tv_nsec values is frought with problems - and if
> every driver is needing to open code some kind of absolute timeout
> to jiffies conversion to use in the kernel's standard wait functions,
> I can see many could fall into similar traps.  I think either we
> need to discourage the use of tv_sec/tv_nsec absolute timeout APIs
> or we need some way to help drivers get this calculation correct.

Note: although drm_etnaviv_timespec supports 64-bit timespec, there's
no way to set it to a valid 64-bit timespec, since clock_gettime() in
userspace for a 32-bit arch won't return 64-bit time.  So, while it
looks like it would be sane to convert etnaviv's timespec pointers to
timespec64, that solves nothing (in fact, I believe it makes it worse.)

I do think the patch below should not be using the 64-bit timespec
until the last conversion to jiffies.

> diff --git a/drivers/gpu/drm/etnaviv/etnaviv_drv.h b/drivers/gpu/drm/etnaviv/etnaviv_drv.h
> index e41f38667c1c..8e575a8e98f6 100644
> --- a/drivers/gpu/drm/etnaviv/etnaviv_drv.h
> +++ b/drivers/gpu/drm/etnaviv/etnaviv_drv.h
> @@ -27,6 +27,7 @@
>  #include <linux/slab.h>
>  #include <linux/list.h>
>  #include <linux/iommu.h>
> +#include <linux/time64.h>
>  #include <linux/types.h>
>  #include <linux/sizes.h>
>  
> @@ -143,19 +144,27 @@ static inline bool fence_after_eq(u32 a, u32 b)
>  	return (s32)(a - b) >= 0;
>  }
>  
> +/*
> + * Etnaviv timeouts are specified wrt CLOCK_MONOTONIC, not jiffies.
> + * We need to calculate the timeout in terms of number of jiffies
> + * between the specified timeout and the current CLOCK_MONOTONIC time.
> + */
>  static inline unsigned long etnaviv_timeout_to_jiffies(
>  	const struct timespec *timeout)
>  {
> -	unsigned long timeout_jiffies = timespec_to_jiffies(timeout);
> -	unsigned long start_jiffies = jiffies;
> -	unsigned long remaining_jiffies;
> +	struct timespec64 ts, to;
> +
> +	to = timespec_to_timespec64(*timeout);
> +
> +	ktime_get_ts64(&ts);
> +
> +	/* timeouts before "now" have already expired */
> +	if (timespec64_compare(&to, &ts) <= 0)
> +		return 0;
>  
> -	if (time_after(start_jiffies, timeout_jiffies))
> -		remaining_jiffies = 0;
> -	else
> -		remaining_jiffies = timeout_jiffies - start_jiffies;
> +	ts = timespec64_sub(to, ts);
>  
> -	return remaining_jiffies;
> +	return timespec64_to_jiffies(&ts);
>  }
>  
>  #endif /* __ETNAVIV_DRV_H__ */
> 
> 
> -- 
> RMK's Patch system: http://www.armlinux.org.uk/developer/patches/
> FTTC broadband for 0.8mile line in suburbia: sync at 8.8Mbps down 630kbps up
> According to speedtest.net: 8.21Mbps down 510kbps up

-- 
RMK's Patch system: http://www.armlinux.org.uk/developer/patches/
FTTC broadband for 0.8mile line in suburbia: sync at 8.8Mbps down 630kbps up
According to speedtest.net: 8.21Mbps down 510kbps up


More information about the etnaviv mailing list