[Intel-gfx] [PATCH 2/2] drm/i915: Reorder hw init to avoid executing with invalid context/mm state

Wed Jan 21 02:12:12 PST 2015

On Fri, Dec 05, 2014 at 07:03:42PM +0200, Ville Syrjälä wrote:
> On Fri, Dec 05, 2014 at 04:26:23PM +0000, Chris Wilson wrote:
> > On Fri, Dec 05, 2014 at 04:53:49PM +0200, Ville Syrjälä wrote:
> > > On Fri, Dec 05, 2014 at 02:38:46PM +0000, Chris Wilson wrote:
> > > > On Fri, Dec 05, 2014 at 04:31:35PM +0200, Ville Syrjälä wrote:
> > > > > On Fri, Dec 05, 2014 at 02:15:22PM +0000, Chris Wilson wrote:
> > > > > > Currently we initialise the rings, add the first context switch to the
> > > > > > ring and execute our golden state then enable (aliasing or full) ppgtt.
> > > > > > However, as we enable ppgtt using direct MMIO but load the PD using
> > > > > > MI_LRI, we end up executing the context switch and golden render state
> > > > > > with an invalid PD generating page faults. To solve this issue, first do
> > > > > > the ppgtt PD setup, then set the default context and write the commands
> > > > > > to run the render state into the ring, before we activate the ring. This
> > > > > > allows us to be sure that the register state is valid before we begin
> > > > > > execution.
> > > > > > 
> > > > > > This was spotted when writing the seqno/request conversion, but only with
> > > > > > the ERROR capture did I realise that it was a necessity now.
> > > > > > 
> > > > > > RFC: cleanup the error handling in i915_gem_init_hw.
> > > > > > 
> > > > > > Signed-off-by: Chris Wilson <chris at chris-wilson.co.uk>
> > > > > > ---
> > > > > >  drivers/gpu/drm/i915/i915_gem.c         | 20 ++++++++++----------
> > > > > >  drivers/gpu/drm/i915/intel_ringbuffer.c |  9 ++++++---
> > > > > >  2 files changed, 16 insertions(+), 13 deletions(-)
> > > > > > 
> > > > > > diff --git a/drivers/gpu/drm/i915/i915_gem.c b/drivers/gpu/drm/i915/i915_gem.c
> > > > > > index c1c11418231b..c13842d3cbc9 100644
> > > > > > --- a/drivers/gpu/drm/i915/i915_gem.c
> > > > > > +++ b/drivers/gpu/drm/i915/i915_gem.c
> > > > > > @@ -4796,15 +4796,15 @@ i915_gem_init_hw(struct drm_device *dev)
> > > > > >  	 */
> > > > > >  	init_unused_rings(dev);
> > > > > >  
> > > > > > -	for_each_ring(ring, dev_priv, i) {
> > > > > > -		ret = ring->init_hw(ring);
> > > > > > -		if (ret)
> > > > > > -			return ret;
> > > > > > -	}
> > > > > > -
> > > > > >  	for (i = 0; i < NUM_L3_SLICES(dev); i++)
> > > > > >  		i915_gem_l3_remap(&dev_priv->ring[RCS], i);
> > > > > 
> > > > > This is going to assume ring->head/tail are already valid?
> > > > 
> > > > We write into the ring obj, not the ring itself, which should be setup
> > > > during the various intel_init_engine, i.e. the backing storage is
> > > > independent of the actual registers.
> > > 
> > > I mean the software shadows, not the registers themselves. When the GPU
> > > hangs I expect rign->head != ring->tail. So what makes those two identical
> > > again after the GPU reset?
> > 
> > Why would they be equal after reset?
> 
> They wouldn't. That's my whole point.
> 
> > At the moment, we discard all
> > outstanding requests which makes them equal.
> 
> OK, so you're saying somehwere we set them to some sane values. But
> I don't see any obvious code for that. The obvious code was in
> init_ring_common(), but you just removed said code in this patch.

When we clear the request list, we should be updating the position of
last known head, which should then be used to set reset the ring->head
(which in turn is used to reprogram the hw on reset/enable).
-Chris

-- 
Chris Wilson, Intel Open Source Technology Centre