etnaviv-gpu 134000.gpu: MMU fault status 0x00000002 on i.XM6 Quad Plus

Thu Aug 31 15:51:40 UTC 2017

On Thu, Aug 31, 2017 at 02:48:31PM +0100, Russell King - ARM Linux wrote:
> On Thu, Aug 31, 2017 at 01:54:55PM +0100, Russell King - ARM Linux wrote:
> > On Thu, Aug 31, 2017 at 02:49:27PM +0200, Lucas Stach wrote:
> > > Hi Russell,
> > > 
> > > Am Donnerstag, den 31.08.2017, 12:18 +0100 schrieb Russell King - ARM
> > > Linux:
> > > > I've just stumbled on a bug related to the way we handle fence
> > > > timeouts.
> > > > 
> > > > For DRM_ETNAVIV_WAIT_FENCE, we have:
> > > > 
> > > > struct drm_etnaviv_wait_fence {
> > > >         __u32 pipe;           /* in */
> > > >         __u32 fence;          /* in */
> > > >         __u32 flags;          /* in, mask of ETNA_WAIT_x */
> > > >         __u32 pad;
> > > >         struct drm_etnaviv_timespec timeout;   /* in */
> > > > };
> > > > 
> > > > where timeout is:
> > > > 
> > > > struct drm_etnaviv_timespec {
> > > >         __s64 tv_sec;          /* seconds */
> > > >         __s64 tv_nsec;         /* nanoseconds */
> > > > };
> > > > 
> > > > The timeout is with respect to the monotonic clock.  If the timeout is
> > > > specified far enough in the future, eg:
> > > > 
> > > > 9088652.2192296410 now 4793684.242296410
> > > > 
> > > > then rather than waiting, the function returns almost immediately with
> > > > ETIMEDOUT.  The requested timeout is equivalent to (uint32_t)~0
> > > > milliseconds.
> > > > 
> > > > In the kernel, we take the drm_etnaviv_timespec, and stick it into a
> > > > struct timespec via the TS() macro.  This gets passed to
> > > > etnaviv_gpu_wait_fence_interruptible(), which uses
> > > > etnaviv_timeout_to_jiffies() to convert to jiffies.  I suspect that
> > > > the conversion to jiffies in timespec_to_jiffies() results in a
> > > > jiffy value that time_after() believes to be before the current time,
> > > > resulting in ultimately a zero jiffy timeout.
> > > > 
> > > > Merely stracing the X server, or adding a fprintf() is enough to
> > > > avoid the problem.
> > > > 
> > > > If you hit this problem, you'll see "fence finish failed" in the Xorg
> > > > log.
> > > > 
> > > > I think doing the time_after() dance after converting to jiffies is
> > > > wrong: if we're going to have an API that accepts absolute time, then
> > > > we should handle times that are beyond the ability for us to schedule
> > > > the wait correctly.
> > > > 
> > > > It looks like other APIs that take a timespec or timeval (eg, ppoll(),
> > > > select(), pselect()) convert the timespec to a ktime value, which
> > > > limits to KTIME_MAX (see time*_to_ktime() and ktime_set()), which is
> > > > a much nicer behaviour than that which the etnaviv DRM driver is
> > > > currently giving us.
> > > > 
> > > Are you going to provide a patch for this, or should I take a look at
> > > fixing this?
> > 
> > I'm still trying to work out exactly what's going on here - if I add
> > a fprintf() into the DDX or strace it, the -ETIMEDOUT error goes away.
> > If I halve the delay, or take one second off the delay (that being
> > ~0 msec) the problem still happens.  That doesn't tie up with the
> > conclusions I came to.
> > 
> > So no, at the moment I'm not intending to do anything about it until
> > I have a robust diagnosis.
> 
> Hmm, even requesting a 60s timeout results in the exact same behaviour.
> When not stracing the X server, I get immediate timeouts.  When stracing
> the X server, the timeouts vanish.  For example:
> 
> timed out: to=4802137.708235896 start=4802077.708235896, end=4802077.708243792
> (EE) armada(0): [etnaviv] etnaviv_commit: fence finish failed: code=-110:errno=110
> 
> That's with the out-by-10 fixed in the nsec scaling in
> etnadrm_convert_timeout(), and as we can see from the timeout and
> 'start' values, it really is 60sec that was requested, and we can
> also see that we definitely returned from DRM_ETNAVIV_WAIT_FENCE
> before the timeout actually expired.
> 
> Adding a fprintf() at the top of viv_fence_finish() in entadrm.c also
> makes the problem go away.
> 
> This is on a HZ=250, NO_HZ_IDLE kernel.
> 
> I know that the timer subsystem has recently had a bug found in it
> triggering RCU stalls for some people, but that's to do with timers
> not firing when they should.  This feels like a timer firing way too
> early, and that's fixed by causing a few extra context switches
> (either by strace or by fprintf() over a ssh connection.)  That
> doesn't quite stack up, because there should be context switches
> between Xorg and the xts program.
> 
> If we can't rely on the kernel getting timers operating right, we
> can't rely on drivers operating correctly, and by implication we can't
> rely on the results that we're seeing on-screen - the problems that
> are being seen _could_ be the result of the fence waits apparently
> completing before they really have (although early timeout should be
> logged by the DDX.)  That would certainly screw up the DDX's tracking
> of who currently has write ownership of the memory, and would lead to
> on-screen corruption.
> 
> My gut instinct is that this is going to take some time to track down,
> because the slightest change just before the timeout being requested
> seems to fix the problem.
> 
> For reference, this is how I'm producing it:
> 
> In one ssh connection, run:
> 
>   Xorg :0 -verbose -terminate -ac -keeptty
> 
> In another ssh connection, run:
> 
>   TET_RESFILE=/tmp/log DISPLAY=:0 \
>   LD_LIBRARY_PATH=${xtssrc}/xts5/src/.libs:${xtssrc}/src/tet3/apilib/.libs: \
>   ${xtssrc}/xts5/Xlib9/XCopyArea/XCopyArea

... and reboot, and the problem goes away.  Smells like a jiffies wrap
issue.

-- 
RMK's Patch system: http://www.armlinux.org.uk/developer/patches/
FTTC broadband for 0.8mile line in suburbia: sync at 8.8Mbps down 630kbps up
According to speedtest.net: 8.21Mbps down 510kbps up