[Intel-gfx] [PATCH 3/3] drm/i915: Disable PMSI sleep messages on all rings around context switches

Tue Dec 16 01:55:23 PST 2014

On Tue, Dec 16, 2014 at 10:19:47AM +0100, Daniel Vetter wrote:
> On Mon, Dec 15, 2014 at 07:03:43PM +0200, Ville Syrjälä wrote:
> > On Mon, Dec 15, 2014 at 04:24:48PM +0000, Chris Wilson wrote:
> > > On Mon, Dec 15, 2014 at 10:41:52AM +0100, Daniel Vetter wrote:
> > > > On Thu, Dec 11, 2014 at 08:17:01AM +0000, Chris Wilson wrote:
> > > > > There exists a current workaround to prevent a hang on context switch
> > > > > should the ring go to sleep in the middle of the restore,
> > > > > WaProgramMiArbOnOffAroundMiSetContext (applicable to all gen7+). In
> > > > > spite of disabling arbitration (which prevents the ring from powering
> > > > > down during the critical section) we were still hitting hangs that had
> > > > > the hallmarks of the known erratum. That is we are still seeing hangs
> > > > > "on the last instruction in the context restore". By comparing -nightly
> > > > > (broken) with requests (working), we were able to deduce that it was the
> > > > > semaphore LRI cross-talk that reproduced the original failure.
> > > > > Explicitly disabling PMSI sleep on the RCS ring was insufficient, all
> > > > > the rings had to be awake to prevent the hangs. Fortunately, we can
> > > > > reduce the wakelock to the MI_SET_CONTEXT operation itself, and so
> > > > > should be able to limit the extra power implications.
> > > > > 
> > > > > Since the MI_ARB_ON_OFF workaround is listed for all gen7 and above
> > > > > products, we should apply this extra hammer for all of the same
> > > > > platforms despite so far that we have only been able to reproduce the
> > > > > hang on certain ivb and hsw models. The last question is whether we want
> > > > > to always use the extra hammer or only when we know semaphores are in
> > > > > operation. At the moment, we only use LRI on non-RCS rings for
> > > > > semaphores, but that may change in the future with the possibility of
> > > > > reintroducing this bug under subtle conditions.
> > > > > 
> > > > > Bugzilla: https://bugs.freedesktop.org/show_bug.cgi?id=80660
> > > > > Bugzilla: https://bugs.freedesktop.org/show_bug.cgi?id=83677
> > > > > Cc: Simon Farnsworth <simon at farnz.org.uk>
> > > > > Signed-off-by: Chris Wilson <chris at chris-wilson.co.uk>
> > > > > Cc: stable at vger.kernel.org
> > > > 
> > > > Oh dear, but awesome piece of debugging!
> > > > 
> > > > First question: How does this work on vlv with the split forcewake domain?
> > > > Any bad side effects like longer ctx switch times?
> > > 
> > > It will work on vlv as well since the split power domains do not matter
> > > here: the LRI is on an active ring, and the result of the LRI is to
> > > temporary wake up the other domain.
> > > The negative impact is that we do wake up other power domains and rings
> > > during the context switch, a small but definite power increase.
> > 
> > I don't think there's any point in worrying about that for context
> > switches as long as we still have the "semaphore LRI wakes up all
> > power wells for every single batch" issue around.
> 
> Hm right, there have been patches floating around for that one ... A short
> mention in the commit message about this would be useful.

If you notice, it was hinted at in the commit message. requests does
deferred semaphore signalling and that was key to realising the trigger
for the bug. I expanded upon that in v2.
-Chris

-- 
Chris Wilson, Intel Open Source Technology Centre