[Intel-gfx] [PATCH 5/5] drm/i915: Solve the GPU reset vs. modeset deadlocks with an rw_semaphore

Mon Jul 3 09:30:48 UTC 2017

On Mon, Jul 03, 2017 at 09:55:48AM +0200, Daniel Vetter wrote:
> On Fri, Jun 30, 2017 at 09:46:36PM +0300, Ville Syrjälä wrote:
> > On Fri, Jun 30, 2017 at 08:23:58PM +0200, Daniel Vetter wrote:
> > > On Fri, Jun 30, 2017 at 5:44 PM, Ville Syrjälä
> > > <ville.syrjala at linux.intel.com> wrote:
> > > >> And if the GEM folks insist the old behavior can't be restored, then we
> > > >> just need a tailor-made get-out-of-jail card for gen4 gpu reset somewhere
> > > >> in i915_sw_fence. Force-completing all render requests atomic updates
> > > >> depend upon is imo the simplest solution to this, and we've had a driver
> > > >> that worked like that for years.
> > > >
> > > > And it used to break all the time. I think we've had to fix it at least
> > > > three times by now. So I tend to think it's better to fix it in a way
> > > > that won't break so easily.
> > > 
> > > Why exactly is making the atomic code massive more tricky the easy
> > > fix?
> > 
> > I don't see what this massive trickyness is. Compared to the rest of
> > atomic what I have is absolutely trivial. Just the
> > duplicate_committed_state() and the '.committed_state = foo'
> > assignments in hw_done(). That's it really.
> 
> >From a quick look and your description it seems full of races.

This "simple" version has problems, yes. The full versions has just
the one potential race between swap_state() and hw_done(). That
seems like pretty easy to handle.

> I'm not
> sure it'll still be simple once those are fixed.
> 
> > > That's the part I don't get. Yes it got broken a bunch because no
> > > one runs CI and everyone forgets that gen3/4 reset the display in gpu
> > > reset, but in the end we do have a depency loop, and either the
> > > modeset side or the render side needs to bail out and cancel it's
> > > async stuff (whether that's a request or a nonblocking flip/atomic
> > > commit doesn't matter). In my opinion, cancelling the request (even if
> > > we're clever and only cancel the requests for the modeset waiters,
> > > which isn't to hard to pull off) seems about the simplest option.
> > > Especially since we need that code anyway, even TDR can't safe
> > > everything and resubmit under all circumstances (at least the buggy
> > > batch can't be resubmitted).
> > > 
> > > Cancelling any kind of atomic commit otoh looks like a lot more
> > > complexity.
> > 
> > I'm not cancelling anything.
> 
> Well by overtaking the in-flight commit you are at least fighting with
> that. Either you need to cancel that one, or insert the gpu reset commit
> at the right point (and with the right state). Current code drops that and
> instead seems to just hope it doesn't lead to tears too much.

The simple version is pretty much "cross our fingers and hope for
the best" type of thing, yes. The full version I cooked up earlier
doesn't rely on hope.

In fact I was able to come up with a testcase that does make the simple
version explode, whereas the full version works just fine. So we should
abandon the idea of using this simple version actually.

> 
> > > Why do you think this is the easier, or at least less
> > > fragile option? This patch series is full of FIXMEs, and even the more
> > > complete set seems to have a pile of holes. Plus we need to stop using
> > > obj->state, and I don't see any easy way to test for that (since the
> > > gen3/4 gpu reset case is the only corner cases that currently needs
> > > that).
> > 
> > We need to fix that stuff anyway if we ever want to queue up multiple
> > commits for the same crtc. The stuff I have that is specific to this
> > reset stuff is actually very simple. The rest is just fixing up the
> > huge mess we've already made.
> 
> Rewriting the world for a regression fix seems a bit much is all I'm
> saying. And I'm not sure your approach works without that "rewrite the
> world" step. Defacto what your current patches seem to result in is
> - we commit the final sw state in gpu reset
> - before we resubmit the rendering

That's true. And we do appear to need the refactoring to make it
actually work. Most of the refactoring amounts to a couple of small
cocci scripts though so not a big deal really.

> 
> That's much easier to pull of by simply force-completing all
> i915_sw_fences before we take any modeset locks in the gpu reset path.
> Note that we don't need to force-complete any i915_gem_request, we can
> just force-complete the i915_sw_fences the work item is blocked on. Needs
> some care to avoid races with a new atomic commit (since we need to
> force-complete before we grab locks one might sneak in), but that's a
> standard pattern.
> 
> Plus we then need to wait for all outstanding nonblocking commits once we
> do have all modeset locks, since with atomic holding the locks only syncs
> against synchronous hw commits (i.e do a fake synchronous commit before we
> nuke the display and we're good). A variant of wait_for_dependencies that
> waits for all crtc instead of just the crtc in an atomic commit would do I
> think.
> 
> None of this requires any of the prep work we need the fancy additional
> atomic stuff you plan to do. Which I think is really good for a regression
> fix.
> 
> So again, why do we need to rewrite the world (since these patches here
> seem to just be the racy poc) to fix reset on gen3/4?
> 
> I know you want to do all this, but tangling up a regression fix in a
> rewrite isn't a good idea in my opinion. I'm not against your long-term
> plans, I just think it'd be good if we can have them as orthogonal pieces
> if feasible.

I've pretty much decided that I'll be happy if I can solve this for
future kernels. If someone else wants to try and cook up some kind
of simple regression fix I don't have any real objections.

-- 
Ville Syrjälä
Intel OTC