[Intel-gfx] [PATCH 5/5] drm/i915: Solve the GPU reset vs. modeset deadlocks with an rw_semaphore

Fri Jun 30 15:44:32 UTC 2017

On Fri, Jun 30, 2017 at 03:30:33PM +0200, Daniel Vetter wrote:
> On Fri, Jun 30, 2017 at 03:25:58PM +0200, Daniel Vetter wrote:
> > On Thu, Jun 29, 2017 at 04:49:48PM +0300, ville.syrjala at linux.intel.com wrote:
> > > From: Ville Syrjälä <ville.syrjala at linux.intel.com>
> > > 
> > > Introduce an rw_semaphore to protect the display commits. All normal
> > > commits use down_read() and hence can proceed in parallel, but GPU reset
> > > will use down_write() making sure no other commits are in progress when
> > > we have to pull the plug on the display engine on pre-g4x platforms.
> > > There are no modeset/gem locks taken inside __intel_atomic_commit_tail()
> > > itself, and we wait for all dependencies before the down_read(), and
> > > thus there is no chance of deadlocks with this scheme.
> > 
> > How does this solve the deadlock? Afaiui the deadlock is that the gpu
> > reset stopped unconditionally completing all requests before it did
> > anything else, including anything with the hw or taking modeset locks.
> > 
> > This ensured that any outstanding flips (we only had pageflips, no atomic
> > ones back then) could complete (although maybe displaying garbage). The
> > only thing we had to do was grab the locks (to avoid concurrent modesets)
> > and then we could happily nuke the hw (since the flips where lost in the
> > CS anyway), and restore it afterwards.
> > 
> > Then the TDR rewrite came around and broke that, which now makes atomic
> > time out waiting for the gpu to complete (because the gpu reset waits for
> > the modeset to complete first). Which means if you want to fix this
> > without breaking TDR, then you need to cancel the pending atomic commits.
> > That seems somewhat what you're attempting here with trying to figure out
> > what the latest hw-committed step is (but that function gets it wrong),
> > and lots more locking and stuff on top.
> > 
> > Why exactly can't we just go back to the old world of force-completing
> > dead requests on gen4 and earlier? That would be tons simpler imo instead
> > of throwing a pile of hacks (which really needs a complete rewrite of the
> > atomic code in i915) in as a work-around. We didn't have TDR on gen4 and
> > earlier for years, going back to that isn't going to hurt anyone.
> > 
> > Making working gen4 gpu reset contigent on cancellable atomic commits and
> > the full commit queue is imo nuts.
> 
> And if the GEM folks insist the old behavior can't be restored, then we
> just need a tailor-made get-out-of-jail card for gen4 gpu reset somewhere
> in i915_sw_fence. Force-completing all render requests atomic updates
> depend upon is imo the simplest solution to this, and we've had a driver
> that worked like that for years.

And it used to break all the time. I think we've had to fix it at least
three times by now. So I tend to think it's better to fix it in a way
that won't break so easily.

-- 
Ville Syrjälä
Intel OTC