[Intel-gfx] [PATCH 0/3] drm/i915: Handle hanging during nonblocking modeset correctly.

Mon Jan 30 08:17:51 UTC 2017

On Fri, Jan 27, 2017 at 03:08:45PM +0000, Chris Wilson wrote:
> On Fri, Jan 27, 2017 at 03:58:08PM +0100, Daniel Vetter wrote:
> > On Fri, Jan 27, 2017 at 02:31:55PM +0000, Chris Wilson wrote:
> > > On Fri, Jan 27, 2017 at 03:21:29PM +0100, Daniel Vetter wrote:
> > > > On Fri, Jan 27, 2017 at 09:30:50AM +0000, Chris Wilson wrote:
> > > > > On Thu, Jan 26, 2017 at 04:59:21PM +0100, Maarten Lankhorst wrote:
> > > > > > When writing some testcases for nonblocking modesets. I found out that the
> > > > > > infinite wait on the old fb was causing issues.
> > > > > 
> > > > > The crux of the issue here is the locked wait for old dependencies and
> > > > > the inability to inject the intel_prepare_reset disabling of all planes.
> > > > > There are a couple of locked waits on struct_mutex within the modeset
> > > > > locks for intel_overlay and if we happen to be using the display plane
> > > > > for the first time.
> > > > > 
> > > > > The first I suggested solving using fences to track dependencies and
> > > > > keep the order between atomic states. Cancelling the outstanding
> > > > > modesets, replacing with a disable and then on restore jumping to the
> > > > > final state look doable. It also requires avoiding the struct_mutex for
> > > > > disabling, which is quite easy. To avoid the wait under struct_mutex,
> > > > > we've talked about switching to mmio, but for starters we could move the
> > > > > wait from inside intel_overlay into the fence for the atomic operation.
> > > > > (But's that a little more surgery than we would like for intel_overlay I
> > > > > guess - dig out Ville's patches for overlay planes?) And to prevent the
> > > > > wait under struct_mutex for pin_to_display_plane, my plane is to move
> > > > > that to an async fenced operation that is then naturally waited upon by
> > > > > the atomic modeset.
> > > > 
> > > > A bit more a hack, but a different idea, and I think hack for gen234.0 is
> > > > ok:
> > > > 
> > > > We complete all the requests before we start the hw reset with fence.error
> > > > = -EIO. But we do this only when we need to get at the display locks. A
> > > > slightly more elegant solution would be to trylock modeset locks, and if
> > > > one of them fails (and only then) complete all requests with -EIO to get
> > > > the concurrent modeset to proceed before we reset the hardware. That's
> > > > essentially the logic we had before all the reworks, and it worked. But I
> > > > didn't look at how scary that all would be to make it work again ...
> > > 
> > > The modeset lock may not just be waiting on our requests (even on pnv we
> > > can expect that there are already users celebrating that pnv+nouveau
> > > finally works ;) and that the display is not the only user/observer of
> > > those requests. Using the requests to break the modeset lock just feels
> > > like the wrong approach.
> > 
> > It's a cycle, and we need to break it somewhere. Another option might be
> > to break the cycle the same way we do it for gem locks: Wake up everyone
> > and restart the modeset ioctl. Since the trouble only happens for
> > synchronous modesets where we hold the locks while waiting for fences, we
> > can also break out of that and restart. And I also don't think that would
> > leak to other drivers, after all our gem locking restart dances also don't
> > leak to other drivers - it's just our own driver's lock which are affected
> > by these special wakupe semantics.
> 
> It's a queue of nonblocking modesets that we need to worry about, afaik.
> Moving the wait for blocking modeset outside of modeset lock is easily
> achievable (and avoiding the other waits under both the modeset + 
> struct_mutex I have at least an idea for). So the challenge is how to
> inject all-planes-off for gen3 and then allow the queue to continue again
> afterwards.

Hm right, I missed the nonblocking updates which don't take locks. But
assuming we do the display reset for gpu resets as a full modeset (i.e.
going through ->atomic_commit) it should still work out correctly:

Starting state: gpu is hung, nonblocking modeset waiting for some requests
to complete.

1. hangcheck kicks in, fires off reset work.

2. We complete all requests with fence.error = -EIO and wake up any
waiters. That means no re-queueing for older platforms, but oh well.

3. We grab all the display locks. Nothing happens yet.

4. We reset the chip, display dies.

5. We run ->atomic_commit to restore things. This will also force the
nonblocking commit worker to complete before this display restore touches
anything.

The only trouble I see is that the nonblocking worker can still touch the
display block while we kill it, which isn't awesome. But we can fix that
by waiting for all pending nonblocking commits in step 3 manually (without
calling into atomic_commit), as long as we do step 2.

So completing everything with EIO unconditionally still seems like the
simplest option that actually works for pre-g4x ...
-Daniel
-- 
Daniel Vetter
Software Engineer, Intel Corporation
http://blog.ffwll.ch