[Intel-gfx] [PATCH 2/2] drm/i915: Capture current context on error

Chris Wilson chris at chris-wilson.co.uk
Sun Feb 24 22:40:19 CET 2013


On Sun, Feb 24, 2013 at 01:24:59PM -0800, Ben Widawsky wrote:
> On Sun, Feb 24, 2013 at 09:34:30AM +0000, Chris Wilson wrote:
> > On Sat, Feb 23, 2013 at 05:30:10PM -0800, Ben Widawsky wrote:
> > > On error, this represents the state of the currently running context at
> > > the time it was loaded.
> > > 
> > > Unfortunately, since we're hung and can't switch out the context this
> > > may not tell us too much about the most current state of the context,
> > > but does give clues about what has happened since loading.
> > > 
> > > Thanks to recent doc updates, we have a little more confidence regarding
> > > what is actually in this memory, and perhaps it will help us gain more
> > > insight into certain bugs. AFAICT, the most interesting info is in the
> > > first page. To save space, we only capture the first page. In the
> > > future, we might want to dump more.
> > > 
> > > Sample of the relevant part of error state:
> > > --- HW Context = 0x01b20000
> > > 00000000 :  00000000 1100105f 00002028 ffff0880
> > > 00000010 :  0000209c feff4040 000020c0 efdf0080
> > > 00000020 :  00002178 00000001 0000217c 00145855
> > > 00000030 :  00002310 00000000 00002314 00000000
> > > 00000040 :  00002318 00000000 0000231c 00000000
> > 
> > Presentation looks reasonable, except it will confuse
> > intel_error_decode as it will match "%x : %x". How about
> > "[%03x] %08x %08x %08x %08x"?
> 
> Will %4x be okay? I suspect 1 page may not be sufficient at some point.
> Also, it's probably worth modifying intel_error_decode since I suspect
> over time we'll want to dump more objects.

%04x is fine. I was simply trimming a few zeroes for brevity. Using []
was the key to making it distinct enough for intel_error_decode was
still being comprehensible for us.
 
> > > References: https://bugs.freedesktop.org/show_bug.cgi?id=55845
> > > Cc: Chris Wilson <chris at chris-wilson.co.uk>
> > > Signed-off-by: Ben Widawsky <ben at bwidawsk.net>
> > > ---
> > 
> > > diff --git a/drivers/gpu/drm/i915/i915_drv.h b/drivers/gpu/drm/i915/i915_drv.h
> > > index e95337c..ab88620 100644
> > > --- a/drivers/gpu/drm/i915/i915_drv.h
> > > +++ b/drivers/gpu/drm/i915/i915_drv.h
> > > @@ -209,6 +209,7 @@ struct drm_i915_error_state {
> > >  	u32 pgtbl_er;
> > >  	u32 ier;
> > >  	u32 ccid;
> > > +	struct drm_i915_error_object *ctx_obj;
> > 
> > Put it next to the other pointers; lest we want to start digging holes.
> 
> On that request, I am going to make 1 per ring then. We don't have any
> non-ring error objects currently. I've also always been in favor of
> having the code more ring independent but that was removed because of
> Daniel.

Yes. I also think of contexts as being a far more generic constructs than
just the current automagic hw save/restore.

> > >  	u32 derrmr;
> > >  	u32 forcewake;
> > >  	bool waiting[I915_NUM_RINGS];
> > > diff --git a/drivers/gpu/drm/i915/i915_irq.c b/drivers/gpu/drm/i915/i915_irq.c
> > > index ebaf558..7f7d241 100644
> > > --- a/drivers/gpu/drm/i915/i915_irq.c
> > > +++ b/drivers/gpu/drm/i915/i915_irq.c
> > > @@ -1321,6 +1321,14 @@ static void i915_capture_error_state(struct drm_device *dev)
> > >  	error->pgtbl_er = I915_READ(PGTBL_ER);
> > >  	error->ccid = I915_READ(CCID);
> > >  
> > > +	if (error->ccid && !dev_priv->hw_contexts_disabled) {
> > > +		list_for_each_entry(obj, &dev_priv->mm.active_list, mm_list)
> > 
> > I am doubtful that the active list will hold the object in all cases, as
> > we only put the context obj onto the active list when switching away.
> > I'd check the gtt_list to be on the safe side.
> 
> Hmm, that shouldn't have been how the code worked when I wrote it, and
> I'm too lazy to look if it's changed. The context should be made active
> on the switch to it. If we use rc6, and don't do that, that's really bad
> because rc6 will automatically save/restore the context (though as long
> as it remained pinned we're safe).  In any case, gtt_list is fine with
> me, though I still believe if it's not on the active list, it's an
> error.

Hmm, the code we ended up with only did move_to_active() when switching
away in order for it to remain pinned until after the MI_SET_CONTEXT had
been processed. Before then it is kept alive through explicit pinning
and referencing from the ring.
 
> > And ignore what we think of hw_context_disabled - if the CCID randomly
> > points to one of our objects, lets attach it.
> 
> I can switch to HAS_HW_CONTEXTS for this. I'm really hoping we never see
> CCID != 0 when not using contexts though.

Complete agree. I also dread the day we see random CCID values. But
stanger things have happened.
-Chris

-- 
Chris Wilson, Intel Open Source Technology Centre



More information about the Intel-gfx mailing list