[Intel-gfx] [PATCH 3/3] drm/i915: Disable PMSI sleep messages on all rings around context switches

Daniel Vetter daniel at ffwll.ch
Tue Dec 16 01:19:47 PST 2014


On Mon, Dec 15, 2014 at 07:03:43PM +0200, Ville Syrjälä wrote:
> On Mon, Dec 15, 2014 at 04:24:48PM +0000, Chris Wilson wrote:
> > On Mon, Dec 15, 2014 at 10:41:52AM +0100, Daniel Vetter wrote:
> > > On Thu, Dec 11, 2014 at 08:17:01AM +0000, Chris Wilson wrote:
> > > > There exists a current workaround to prevent a hang on context switch
> > > > should the ring go to sleep in the middle of the restore,
> > > > WaProgramMiArbOnOffAroundMiSetContext (applicable to all gen7+). In
> > > > spite of disabling arbitration (which prevents the ring from powering
> > > > down during the critical section) we were still hitting hangs that had
> > > > the hallmarks of the known erratum. That is we are still seeing hangs
> > > > "on the last instruction in the context restore". By comparing -nightly
> > > > (broken) with requests (working), we were able to deduce that it was the
> > > > semaphore LRI cross-talk that reproduced the original failure.
> > > > Explicitly disabling PMSI sleep on the RCS ring was insufficient, all
> > > > the rings had to be awake to prevent the hangs. Fortunately, we can
> > > > reduce the wakelock to the MI_SET_CONTEXT operation itself, and so
> > > > should be able to limit the extra power implications.
> > > > 
> > > > Since the MI_ARB_ON_OFF workaround is listed for all gen7 and above
> > > > products, we should apply this extra hammer for all of the same
> > > > platforms despite so far that we have only been able to reproduce the
> > > > hang on certain ivb and hsw models. The last question is whether we want
> > > > to always use the extra hammer or only when we know semaphores are in
> > > > operation. At the moment, we only use LRI on non-RCS rings for
> > > > semaphores, but that may change in the future with the possibility of
> > > > reintroducing this bug under subtle conditions.
> > > > 
> > > > Bugzilla: https://bugs.freedesktop.org/show_bug.cgi?id=80660
> > > > Bugzilla: https://bugs.freedesktop.org/show_bug.cgi?id=83677
> > > > Cc: Simon Farnsworth <simon at farnz.org.uk>
> > > > Signed-off-by: Chris Wilson <chris at chris-wilson.co.uk>
> > > > Cc: stable at vger.kernel.org
> > > 
> > > Oh dear, but awesome piece of debugging!
> > > 
> > > First question: How does this work on vlv with the split forcewake domain?
> > > Any bad side effects like longer ctx switch times?
> > 
> > It will work on vlv as well since the split power domains do not matter
> > here: the LRI is on an active ring, and the result of the LRI is to
> > temporary wake up the other domain.
> > The negative impact is that we do wake up other power domains and rings
> > during the context switch, a small but definite power increase.
> 
> I don't think there's any point in worrying about that for context
> switches as long as we still have the "semaphore LRI wakes up all
> power wells for every single batch" issue around.

Hm right, there have been patches floating around for that one ... A short
mention in the commit message about this would be useful.

> > > > ---
> > > >  drivers/gpu/drm/i915/i915_gem_context.c | 39 ++++++++++++++++++++++++---------
> > > >  drivers/gpu/drm/i915/i915_reg.h         |  2 ++
> > > >  2 files changed, 31 insertions(+), 10 deletions(-)
> > > > 
> > > > diff --git a/drivers/gpu/drm/i915/i915_gem_context.c b/drivers/gpu/drm/i915/i915_gem_context.c
> > > > index 2acf5803cf32..724ccecea06a 100644
> > > > --- a/drivers/gpu/drm/i915/i915_gem_context.c
> > > > +++ b/drivers/gpu/drm/i915/i915_gem_context.c
> > > > @@ -484,7 +484,9 @@ mi_set_context(struct intel_engine_cs *ring,
> > > >  	       u32 hw_flags)
> > > >  {
> > > >  	u32 flags = hw_flags | MI_MM_SPACE_GTT;
> > > > -	int ret;
> > > > +	struct intel_engine_cs *engine;
> > > > +	const int num_rings = i915_semaphore_is_enabled(ring->dev) ? hweight32(INTEL_INFO(ring->dev)->ring_mask) : 0;
> > > > +	int len, i, ret;
> > > >  
> > > >  	/* w/a: If Flush TLB Invalidation Mode is enabled, driver must do a TLB
> > > >  	 * invalidation prior to MI_SET_CONTEXT. On GEN6 we don't set the value
> > > > @@ -501,15 +503,26 @@ mi_set_context(struct intel_engine_cs *ring,
> > > >  	if (!IS_HASWELL(ring->dev) && INTEL_INFO(ring->dev)->gen < 8)
> > > >  		flags |= (MI_SAVE_EXT_STATE_EN | MI_RESTORE_EXT_STATE_EN);
> > > >  
> > > > -	ret = intel_ring_begin(ring, 6);
> > > > +
> > > > +	len = 4;
> > > > +	if (INTEL_INFO(ring->dev)->gen >= 7)
> > > > +		len += 4*num_rings + 2;
> > > > +
> > > > +	ret = intel_ring_begin(ring, len);
> > > >  	if (ret)
> > > >  		return ret;
> > > >  
> > > >  	/* WaProgramMiArbOnOffAroundMiSetContext:ivb,vlv,hsw,bdw,chv */
> > > > -	if (INTEL_INFO(ring->dev)->gen >= 7)
> > > > -		intel_ring_emit(ring, MI_ARB_ON_OFF | MI_ARB_DISABLE);
> > > > -	else
> > > > -		intel_ring_emit(ring, MI_NOOP);
> > > > +	if (INTEL_INFO(ring->dev)->gen >= 7) {
> > > > +		if (num_rings) {
> > > > +			intel_ring_emit(ring, MI_LOAD_REGISTER_IMM(num_rings));
> > > > +			for_each_ring(engine, to_i915(ring->dev), i) {
> > > > +				intel_ring_emit(ring, RING_PSMI_CTL(engine->mmio_base));
> > > > +				intel_ring_emit(ring, _MASKED_BIT_ENABLE(GEN6_PSMI_SLEEP_MSG_DISABLE));
> > > > +			}
> > > > +		} else
> > > > +			intel_ring_emit(ring, MI_ARB_ON_OFF | MI_ARB_DISABLE);
> > > 
> > > For added paranoia, shouldn't we keep this one here? Or is the only thing
> > > this does frob the idle notifications like PSMI? I kinda prefer we just
> > > keep it if there's no harmful effects ...
> > 
> > Yeah, MI_ARB_ON_OFF is just a magic command used to do the same as the
> > PSMI twiddling on the local ring.
> > 
> > We can do MI_ARB_ON_OFF + LRI(!RCS) which is slightly shorter
> > instruction wise, closer to the original w/a and looks slightly neater.

Tbh without confirmation from a hw uarch that they're equivalent I prefer
we keep the mi_arb dance for the local ring, just in case that has some
other magic sauce going on. Means a bit more testing unfortunately, but
this has been going on for a long time already anyway.

Thanks, Daniel
-- 
Daniel Vetter
Software Engineer, Intel Corporation
+41 (0) 79 365 57 48 - http://blog.ffwll.ch


More information about the Intel-gfx mailing list