etnaviv-gpu 134000.gpu: MMU fault status 0x00000002 on i.XM6 Quad Plus

Thu Aug 31 13:48:31 UTC 2017

On Thu, Aug 31, 2017 at 01:54:55PM +0100, Russell King - ARM Linux wrote:
> On Thu, Aug 31, 2017 at 02:49:27PM +0200, Lucas Stach wrote:
> > Hi Russell,
> > 
> > Am Donnerstag, den 31.08.2017, 12:18 +0100 schrieb Russell King - ARM
> > Linux:
> > > I've just stumbled on a bug related to the way we handle fence
> > > timeouts.
> > > 
> > > For DRM_ETNAVIV_WAIT_FENCE, we have:
> > > 
> > > struct drm_etnaviv_wait_fence {
> > >         __u32 pipe;           /* in */
> > >         __u32 fence;          /* in */
> > >         __u32 flags;          /* in, mask of ETNA_WAIT_x */
> > >         __u32 pad;
> > >         struct drm_etnaviv_timespec timeout;   /* in */
> > > };
> > > 
> > > where timeout is:
> > > 
> > > struct drm_etnaviv_timespec {
> > >         __s64 tv_sec;          /* seconds */
> > >         __s64 tv_nsec;         /* nanoseconds */
> > > };
> > > 
> > > The timeout is with respect to the monotonic clock.  If the timeout is
> > > specified far enough in the future, eg:
> > > 
> > > 9088652.2192296410 now 4793684.242296410
> > > 
> > > then rather than waiting, the function returns almost immediately with
> > > ETIMEDOUT.  The requested timeout is equivalent to (uint32_t)~0
> > > milliseconds.
> > > 
> > > In the kernel, we take the drm_etnaviv_timespec, and stick it into a
> > > struct timespec via the TS() macro.  This gets passed to
> > > etnaviv_gpu_wait_fence_interruptible(), which uses
> > > etnaviv_timeout_to_jiffies() to convert to jiffies.  I suspect that
> > > the conversion to jiffies in timespec_to_jiffies() results in a
> > > jiffy value that time_after() believes to be before the current time,
> > > resulting in ultimately a zero jiffy timeout.
> > > 
> > > Merely stracing the X server, or adding a fprintf() is enough to
> > > avoid the problem.
> > > 
> > > If you hit this problem, you'll see "fence finish failed" in the Xorg
> > > log.
> > > 
> > > I think doing the time_after() dance after converting to jiffies is
> > > wrong: if we're going to have an API that accepts absolute time, then
> > > we should handle times that are beyond the ability for us to schedule
> > > the wait correctly.
> > > 
> > > It looks like other APIs that take a timespec or timeval (eg, ppoll(),
> > > select(), pselect()) convert the timespec to a ktime value, which
> > > limits to KTIME_MAX (see time*_to_ktime() and ktime_set()), which is
> > > a much nicer behaviour than that which the etnaviv DRM driver is
> > > currently giving us.
> > > 
> > Are you going to provide a patch for this, or should I take a look at
> > fixing this?
> 
> I'm still trying to work out exactly what's going on here - if I add
> a fprintf() into the DDX or strace it, the -ETIMEDOUT error goes away.
> If I halve the delay, or take one second off the delay (that being
> ~0 msec) the problem still happens.  That doesn't tie up with the
> conclusions I came to.
> 
> So no, at the moment I'm not intending to do anything about it until
> I have a robust diagnosis.

Hmm, even requesting a 60s timeout results in the exact same behaviour.
When not stracing the X server, I get immediate timeouts.  When stracing
the X server, the timeouts vanish.  For example:

timed out: to=4802137.708235896 start=4802077.708235896, end=4802077.708243792
(EE) armada(0): [etnaviv] etnaviv_commit: fence finish failed: code=-110:errno=110

That's with the out-by-10 fixed in the nsec scaling in
etnadrm_convert_timeout(), and as we can see from the timeout and
'start' values, it really is 60sec that was requested, and we can
also see that we definitely returned from DRM_ETNAVIV_WAIT_FENCE
before the timeout actually expired.

Adding a fprintf() at the top of viv_fence_finish() in entadrm.c also
makes the problem go away.

This is on a HZ=250, NO_HZ_IDLE kernel.

I know that the timer subsystem has recently had a bug found in it
triggering RCU stalls for some people, but that's to do with timers
not firing when they should.  This feels like a timer firing way too
early, and that's fixed by causing a few extra context switches
(either by strace or by fprintf() over a ssh connection.)  That
doesn't quite stack up, because there should be context switches
between Xorg and the xts program.

If we can't rely on the kernel getting timers operating right, we
can't rely on drivers operating correctly, and by implication we can't
rely on the results that we're seeing on-screen - the problems that
are being seen _could_ be the result of the fence waits apparently
completing before they really have (although early timeout should be
logged by the DDX.)  That would certainly screw up the DDX's tracking
of who currently has write ownership of the memory, and would lead to
on-screen corruption.

My gut instinct is that this is going to take some time to track down,
because the slightest change just before the timeout being requested
seems to fix the problem.

For reference, this is how I'm producing it:

In one ssh connection, run:

  Xorg :0 -verbose -terminate -ac -keeptty

In another ssh connection, run:

  TET_RESFILE=/tmp/log DISPLAY=:0 \
  LD_LIBRARY_PATH=${xtssrc}/xts5/src/.libs:${xtssrc}/src/tet3/apilib/.libs: \
  ${xtssrc}/xts5/Xlib9/XCopyArea/XCopyArea

-- 
RMK's Patch system: http://www.armlinux.org.uk/developer/patches/
FTTC broadband for 0.8mile line in suburbia: sync at 8.8Mbps down 630kbps up
According to speedtest.net: 8.21Mbps down 510kbps up