[Intel-gfx] [PATCH 1/2] drm/i915: Make sure engines are idle during GPU idling in LR mode

Thu Nov 3 21:14:52 UTC 2016

On Thu, Nov 03, 2016 at 10:57:23PM +0200, Imre Deak wrote:
> On Thu, 2016-11-03 at 18:59 +0000, Chris Wilson wrote:
> > On Thu, Nov 03, 2016 at 06:19:37PM +0200, Imre Deak wrote:
> > > We assume that the GPU is idle once receiving the seqno via the last
> > > request's user interrupt. In execlist mode the corresponding context
> > > completed interrupt can be delayed though and until this latter
> > > interrupt arrives we consider the request to be pending on the ELSP
> > > submit port. This can cause a problem during system suspend where this
> > > last request will be seen by the resume code as still pending. Such
> > > pending requests are normally replayed after a GPU reset, but during
> > > resume we reset both SW and HW tracking of the ring head/tail pointers,
> > > so replaying the pending request with its stale tale pointer will leave
> > > the ring in an inconsistent state. A subsequent request submission can
> > > lead then to the GPU executing from uninitialized area in the ring
> > > behind the above stale tail pointer.
> > > 
> > > Fix this by making sure any pending request on the ELSP port is
> > > completed before suspending. I used a polling wait since the completion
> > > time I measured was <1ms and since normally we only need to wait during
> > > system suspend. GPU idling during runtime suspend is scheduled with a
> > > delay (currently 50-100ms) after the retirement of the last request at
> > > which point the context completed interrupt must have arrived already.
> > > 
> > > The chance of this bug was increased by
> > > 
> > > commit 1c777c5d1dcdf8fa0223fcff35fb387b5bb9517a
> > > Author: Imre Deak <imre.deak at intel.com>
> > > Date:   Wed Oct 12 17:46:37 2016 +0300
> > > 
> > >     drm/i915/hsw: Fix GPU hang during resume from S3-devices state
> > > 
> > > but it could happen even without the explicit GPU reset, since we
> > > disable interrupts afterwards during the suspend sequence.
> > > 
> > > Cc: Chris Wilson <chris at chris-wilson.co.uk>
> > > Cc: Mika Kuoppala <mika.kuoppala at intel.com>
> > > Bugzilla: https://bugs.freedesktop.org/show_bug.cgi?id=98470
> > > Signed-off-by: Imre Deak <imre.deak at intel.com>
> > > ---
> > >  drivers/gpu/drm/i915/i915_gem.c  |  3 +++
> > >  drivers/gpu/drm/i915/intel_lrc.c | 12 ++++++++++++
> > >  drivers/gpu/drm/i915/intel_lrc.h |  1 +
> > >  3 files changed, 16 insertions(+)
> > > 
> > > diff --git a/drivers/gpu/drm/i915/i915_gem.c b/drivers/gpu/drm/i915/i915_gem.c
> > > index 1f995ce..5ff02b5 100644
> > > --- a/drivers/gpu/drm/i915/i915_gem.c
> > > +++ b/drivers/gpu/drm/i915/i915_gem.c
> > > @@ -2766,6 +2766,9 @@ i915_gem_idle_work_handler(struct work_struct *work)
> > >  	if (dev_priv->gt.active_requests)
> > >  		goto out_unlock;
> > >  
> > > +	if (i915.enable_execlists)
> > > +		intel_lr_wait_engines_idle(dev_priv);
> > 
> > Idle work handler... So runtime suspend.
> > Anyway this is not an ideal place for a stall under struct_mutex (even if
> > 16x10us, it's the principle!).
> 
> During runtime suspend this won't add any overhead since the context
> done interrupt happened already (unless there is a bug somewhere else).

Where is that guaranteed? I thought we only serialised with the pm
interrupts. Remember this happens before rpm suspend, since
gem_idle_work_handler is responsible for dropping the GPU wakelock.

> > Move this to before the first READ_ONCE(dev_priv->gt.active_requests);
> > so we stall before taking the lock, and skip if any new requests arrive
> > whilst waiting.
> > 
> > (Also i915.enable_execlists is forbidden. But meh)
> > 
> > static struct drm_i915_gem_request *
> > execlists_active_port(struct intel_engine_cs *engine)
> > {
> > 	struct drm_i915_gem_request *request;
> > 
> > 	request = READ_ONCE(engine->execlist_port[1]);
> > 	if (request)
> > 		return request;
> > 
> > 	return READ_ONCE(engine->execlist_port[0]);
> > }
> > 
> > /* Wait for execlists to settle, but bail if any new requests come in */
> > for_each_engine(engine, dev_priv, id) {
> > 	struct drm_i915_gem_request *request;
> > 
> > 	request = execlists_active_port(engine);
> > 	if (!request)
> > 		continue;
> > 
> > 	if (wait_for(execlists_active_port(engine) != request, 10))
> > 		DRM_ERROR("Timeout waiting for %s to idle\n", engine->name);
> > }
> 
> Hm, but we still need to re-check and bail out if not idle with
> struct_mutex held, since gt.active_requests could go 0->1->0 before
> taking struct_mutex? I can rewrite things with that check added, using
> the above.

Hmm, apparently we don't care ;) If the context-done interrupt is
serialised with runtime suspend, then we don't need a wait here at all.
On the system path there are no new requests and we are just flushing
the idle worker.

But yes, for the sake of correctness do both an unlocked wait followed
by a locked wait.
-Chris

-- 
Chris Wilson, Intel Open Source Technology Centre