[PATCH 5/5] drm/i915: Solve the GPU reset vs. modeset deadlocks with an rw_semaphore

Fri Jun 30 13:30:33 UTC 2017

On Fri, Jun 30, 2017 at 03:25:58PM +0200, Daniel Vetter wrote:
> On Thu, Jun 29, 2017 at 04:49:48PM +0300, ville.syrjala at linux.intel.com wrote:
> > From: Ville Syrjälä <ville.syrjala at linux.intel.com>
> > 
> > Introduce an rw_semaphore to protect the display commits. All normal
> > commits use down_read() and hence can proceed in parallel, but GPU reset
> > will use down_write() making sure no other commits are in progress when
> > we have to pull the plug on the display engine on pre-g4x platforms.
> > There are no modeset/gem locks taken inside __intel_atomic_commit_tail()
> > itself, and we wait for all dependencies before the down_read(), and
> > thus there is no chance of deadlocks with this scheme.
> 
> How does this solve the deadlock? Afaiui the deadlock is that the gpu
> reset stopped unconditionally completing all requests before it did
> anything else, including anything with the hw or taking modeset locks.
> 
> This ensured that any outstanding flips (we only had pageflips, no atomic
> ones back then) could complete (although maybe displaying garbage). The
> only thing we had to do was grab the locks (to avoid concurrent modesets)
> and then we could happily nuke the hw (since the flips where lost in the
> CS anyway), and restore it afterwards.
> 
> Then the TDR rewrite came around and broke that, which now makes atomic
> time out waiting for the gpu to complete (because the gpu reset waits for
> the modeset to complete first). Which means if you want to fix this
> without breaking TDR, then you need to cancel the pending atomic commits.
> That seems somewhat what you're attempting here with trying to figure out
> what the latest hw-committed step is (but that function gets it wrong),
> and lots more locking and stuff on top.
> 
> Why exactly can't we just go back to the old world of force-completing
> dead requests on gen4 and earlier? That would be tons simpler imo instead
> of throwing a pile of hacks (which really needs a complete rewrite of the
> atomic code in i915) in as a work-around. We didn't have TDR on gen4 and
> earlier for years, going back to that isn't going to hurt anyone.
> 
> Making working gen4 gpu reset contigent on cancellable atomic commits and
> the full commit queue is imo nuts.

And if the GEM folks insist the old behavior can't be restored, then we
just need a tailor-made get-out-of-jail card for gen4 gpu reset somewhere
in i915_sw_fence. Force-completing all render requests atomic updates
depend upon is imo the simplest solution to this, and we've had a driver
that worked like that for years.

And as long as TDR resubmits the batches the render-glitch will at most be
temporary, but that's totally fine: We're killing the entire display block
in the reset anyway, there's no way the user won't notice _that_ kind of
glitch.

Either way, let's please not over-engineer something for a dead-old
platform when something much, much, much simpler worked for years, and
should keep on working for another few years.
-Daniel

PS: One issue with atomic is that there's the small matter of having to
wait for pending atomic commits to complete before we nuke the display, to
avoid upsetting the display code. We could do that with a dummy commit, or
just having a special wait_for_depencies that waits for all CRTC to
complete their pending atomic updates.
-- 
Daniel Vetter
Software Engineer, Intel Corporation
http://blog.ffwll.ch