[Intel-gfx] [PATCH 1/2] drm/i915: Use a heavyweight irq-seqno barrier for gen6+

Wed Feb 15 12:32:18 UTC 2017

On Wed, Feb 15, 2017 at 12:15:43PM +0000, Chris Wilson wrote:
> On Wed, Feb 15, 2017 at 01:59:59PM +0200, Ville Syrjälä wrote:
> > On Wed, Feb 15, 2017 at 11:36:08AM +0000, Chris Wilson wrote:
> > > I tried lightening the barrier in commit 9b9ed3093613 ("drm/i915: Remove
> > > forcewake dance from seqno/irq barrier on legacy gen6+") but that
> > > appears slightly on the optimistic side as we detect a very rare missed
> > > interrupt on a few machines. This patch goes super heavy with a
> > > force-waked sync-flush dance (enforces a delay as we perform a
> > > round-trip with the GPU during which it flushes it write caches - these
> > > should already be flushed just prior to the interrupt and so should be
> > > fairly light as we respond to the interrupt). The saving grace to using
> > > such a heavy barrier is that we only apply it following once an interrupt,
> > > and only if the expect seqno hasn't landed.
> > > 
> > > Bugzilla: https://bugs.freedesktop.org/show_bug.cgi?id=99816
> > > Fixes: 9b9ed3093613 ("drm/i915: Remove forcewake dance from seqno/irq barrier on legacy gen6+")
> > > Signed-off-by: Chris Wilson <chris at chris-wilson.co.uk>
> > > Cc: Mika Kuoppala <mika.kuoppala at intel.com>
> > > ---
> > >  drivers/gpu/drm/i915/intel_ringbuffer.c | 40 +++++++++++++++++++--------------
> > >  1 file changed, 23 insertions(+), 17 deletions(-)
> > > 
> > > diff --git a/drivers/gpu/drm/i915/intel_ringbuffer.c b/drivers/gpu/drm/i915/intel_ringbuffer.c
> > > index 69c30a76d395..d1f408938479 100644
> > > --- a/drivers/gpu/drm/i915/intel_ringbuffer.c
> > > +++ b/drivers/gpu/drm/i915/intel_ringbuffer.c
> > > @@ -1492,25 +1492,26 @@ static void
> > >  gen6_seqno_barrier(struct intel_engine_cs *engine)
> > >  {
> > >  	struct drm_i915_private *dev_priv = engine->i915;
> > > -
> > > -	/* Workaround to force correct ordering between irq and seqno writes on
> > > -	 * ivb (and maybe also on snb) by reading from a CS register (like
> > > -	 * ACTHD) before reading the status page.
> > > -	 *
> > > -	 * Note that this effectively stalls the read by the time it takes to
> > > -	 * do a memory transaction, which more or less ensures that the write
> > > -	 * from the GPU has sufficient time to invalidate the CPU cacheline.
> > > -	 * Alternatively we could delay the interrupt from the CS ring to give
> > > -	 * the write time to land, but that would incur a delay after every
> > > -	 * batch i.e. much more frequent than a delay when waiting for the
> > > -	 * interrupt (with the same net latency).
> > > +	i915_reg_t reg = RING_INSTPM(engine->mmio_base);
> > > +
> > > +	/* Tell the gpu to flush its render cache (which should mostly be
> > > +	 * emptied as we flushed it before the interrupt) and wait for it
> > > +	 * to signal completion. This imposes a round trip that is at least
> > > +	 * as long as the memory access to memory from the GPU and the
> > > +	 * uncached mmio - that should be sufficient for the outstanding
> > > +	 * write of the seqno to land!
> > >  	 *
> > > -	 * Also note that to prevent whole machine hangs on gen7, we have to
> > > -	 * take the spinlock to guard against concurrent cacheline access.
> > > +	 * Unfortunately, this requires the GPU to be kicked out of rc6 as
> > > +	 * we wait (both incurring extra delay in acquiring the wakeref) and
> > > +	 * extra power.
> > >  	 */
> > > -	spin_lock_irq(&dev_priv->uncore.lock);
> > > -	POSTING_READ_FW(RING_ACTHD(engine->mmio_base));
> > > -	spin_unlock_irq(&dev_priv->uncore.lock);
> > > +
> > > +	intel_uncore_forcewake_get(dev_priv, engine->fw_domains);
> > > +	I915_WRITE_FW(reg, _MASKED_BIT_ENABLE(INSTPM_SYNC_FLUSH));
> > > +	WARN_ON_ONCE(intel_wait_for_register_fw(dev_priv,
> > > +						reg, INSTPM_SYNC_FLUSH, 0,
> > > +						10));
> > > +	intel_uncore_forcewake_put(dev_priv, engine->fw_domains);
> > 
> > Why do I have a nagging feeling that sync flush is somehow busted? Maybe
> > I'm confusing it with the operation flush thing, which we disable.
> 
> Note we are not using it as an actual flush, just a means for a round
> trip with hw. Using it as an estimate for the roundtrip of GPU with memory
> is a bonus, but afaik the MI_STORE_DATA itself is independent of the
> caches and even SyncFlush.
>  
> > Hmm. The spec does say sync flush should only be issued after stopping the
> > rings.
> 
> I was working on the basis that if you wanted to guarrantee that all
> writes have been flushed and no new ones created, you would have to stop
> the rings (and that was what the spec meant).
> 
> > If we were to use it, couldn't we also tell it to write the status into
> > the HWS and poll that instead?
> 
> Hmm, that's a reasonable idea. We can assume that the writes into the
> HWS (seqno, then the round trip) are ordered so by polling on the
> HWS for the out-of-band semaphore should give us the right timing.
> 
> Now I have to remember what writes into and what masks the HWS...

Speaking of which, our HWSTAM stuff in the irq code looks quite
fubar. We're unmasking all sorts of weird bits on pre-snb platforms,
and that happens well before we setup up the HWS IIRC. And on snb+
we aren't doing anything apparently, so we just hope that everything
starts out masked I guess. The execlist code does explicitly mask
everything.

-- 
Ville Syrjälä
Intel OTC