[Intel-gfx] [PATCH] drm/i915: Evict CS TLBs between batches

Mon Sep 8 10:15:50 CEST 2014

On Mon, Sep 08, 2014 at 10:03:51AM +0200, Daniel Vetter wrote:
> On Sun, Sep 07, 2014 at 09:08:31AM +0100, Chris Wilson wrote:
> > Running igt, I was encountering the invalid TLB bug on my 845g, despite
> > that it was using the CS workaround. Examining the w/a buffer in the
> > error state, showed that the copy from the user batch into the
> > workaround itself was suffering from the invalid TLB bug (the first
> > cacheline was broken with the first two words reversed). Time to try a
> > fresh approach. This extends the workaround to write into each page of
> > our scratch buffer in order to overflow the TLB and evict the invalid
> > entries. This could be refined to only do so after we update the GTT,
> > but for simplicity, we do it before each batch.
> > 
> > I suspect this supersedes our current workaround, but for safety keep
> > doing both.
> 
> I suspect that we might end up with just an elaborate delay
> implementation, but if it works then it's good. One nitpick below, with
> that addressed this is Reviewed-by: Daniel Vetter <daniel.vetter at ffwll.ch>

One way to test that is simply comparing 64x4096 byte writes in the same
page vs 64x4 byte writes in 64 different pages. That should be roughly
the same latency (thought with TLB fetches you never be too sure) and
demonstrate that it is either the TLB or the delay that's the factor.

In my testing, I did multiple copies into the batch w/a so that I was
reasonably sure that what the source was stable and the copy of the
source didn't match N times.

> > 
> > Signed-off-by: Chris Wilson <chris at chris-wilson.co.uk>
> > Cc: Daniel Vetter <daniel.vetter at ffwll.ch>
> > Cc: stable at vger.kernel.org
> > ---
> >  drivers/gpu/drm/i915/i915_reg.h         | 12 ++++---
> >  drivers/gpu/drm/i915/intel_ringbuffer.c | 62 +++++++++++++++++++--------------
> >  2 files changed, 44 insertions(+), 30 deletions(-)
> > 
> > diff --git a/drivers/gpu/drm/i915/i915_reg.h b/drivers/gpu/drm/i915/i915_reg.h
> > index e4d7607da2c4..f29b44c86a2f 100644
> > --- a/drivers/gpu/drm/i915/i915_reg.h
> > +++ b/drivers/gpu/drm/i915/i915_reg.h
> > @@ -334,16 +334,20 @@
> >  #define GFX_OP_DESTBUFFER_INFO	 ((0x3<<29)|(0x1d<<24)|(0x8e<<16)|1)
> >  #define GFX_OP_DRAWRECT_INFO     ((0x3<<29)|(0x1d<<24)|(0x80<<16)|(0x3))
> >  #define GFX_OP_DRAWRECT_INFO_I965  ((0x7900<<16)|0x2)
> > -#define SRC_COPY_BLT_CMD                ((2<<29)|(0x43<<22)|4)
> > +
> > +#define COLOR_BLT_CMD			(2<<29 | 0x40<<22 | (5-2))
> > +#define SRC_COPY_BLT_CMD		((2<<29)|(0x43<<22)|4)
> >  #define XY_SRC_COPY_BLT_CMD		((2<<29)|(0x53<<22)|6)
> >  #define XY_MONO_SRC_COPY_IMM_BLT	((2<<29)|(0x71<<22)|5)
> > -#define XY_SRC_COPY_BLT_WRITE_ALPHA	(1<<21)
> > -#define XY_SRC_COPY_BLT_WRITE_RGB	(1<<20)
> > +#define   BLT_WRITE_A			(2<<20)
> > +#define   BLT_WRITE_RGB			(1<<20)
> > +#define   BLT_WRITE_RGBA		(BLT_WRITE_RGB | BLT_WRITE_A)
> >  #define   BLT_DEPTH_8			(0<<24)
> >  #define   BLT_DEPTH_16_565		(1<<24)
> >  #define   BLT_DEPTH_16_1555		(2<<24)
> >  #define   BLT_DEPTH_32			(3<<24)
> > -#define   BLT_ROP_GXCOPY		(0xcc<<16)
> > +#define   BLT_ROP_SRC_COPY		(0xcc<<16)
> > +#define   BLT_ROP_COLOR_COPY		(0xf0<<16)
> >  #define XY_SRC_COPY_BLT_SRC_TILED	(1<<15) /* 965+ only */
> >  #define XY_SRC_COPY_BLT_DST_TILED	(1<<11) /* 965+ only */
> >  #define CMD_OP_DISPLAYBUFFER_INFO ((0x0<<29)|(0x14<<23)|2)
> > diff --git a/drivers/gpu/drm/i915/intel_ringbuffer.c b/drivers/gpu/drm/i915/intel_ringbuffer.c
> > index 16371a444426..acd933b1b027 100644
> > --- a/drivers/gpu/drm/i915/intel_ringbuffer.c
> > +++ b/drivers/gpu/drm/i915/intel_ringbuffer.c
> > @@ -1368,49 +1368,59 @@ i830_dispatch_execbuffer(struct intel_engine_cs *ring,
> >  				u64 offset, u32 len,
> >  				unsigned flags)
> >  {
> > +	u32 cs_offset = ring->scratch.gtt_offset;
> >  	int ret;
> >  
> > -	if (flags & I915_DISPATCH_PINNED) {
> > -		ret = intel_ring_begin(ring, 4);
> > -		if (ret)
> > -			return ret;
> > +	ret = intel_ring_begin(ring, 6);
> > +	if (ret)
> > +		return ret;
> >  
> > -		intel_ring_emit(ring, MI_BATCH_BUFFER);
> > -		intel_ring_emit(ring, offset | (flags & I915_DISPATCH_SECURE ? 0 : MI_BATCH_NON_SECURE));
> > -		intel_ring_emit(ring, offset + len - 8);
> > -		intel_ring_emit(ring, MI_NOOP);
> > -		intel_ring_advance(ring);
> > -	} else {
> > -		u32 cs_offset = ring->scratch.gtt_offset;
> > +	/* Evict the invalid PTE TLBs */
> > +	intel_ring_emit(ring, COLOR_BLT_CMD | BLT_WRITE_RGBA);
> > +	intel_ring_emit(ring, BLT_DEPTH_32 | BLT_ROP_COLOR_COPY | 4096);
> > +	intel_ring_emit(ring, 64 << 16 | 4); /* load each page */
> 
> s/64/(I830M_BACHT_LIMIT / 4096)/ for clarity

Hmm, I don't think so. My thinking here is that this is TLB entries,
so perhaps

#define I830M_TLB_ENTRIES 64
#define I830M_BATCH_LIMIT 256*1024 /* uAPI convention */

#define I830M_WA_BUFSIZE MAX(GTT_PAGE_SIZE * I830M_TLB_ENTRIES, I830M_BATCH_LIMIT)
-Chris

-- 
Chris Wilson, Intel Open Source Technology Centre