[Intel-gfx] [PATCH 2/2] drm/i915: Capture current context on error

Ben Widawsky ben at bwidawsk.net
Sun Feb 24 22:24:59 CET 2013


On Sun, Feb 24, 2013 at 09:34:30AM +0000, Chris Wilson wrote:
> On Sat, Feb 23, 2013 at 05:30:10PM -0800, Ben Widawsky wrote:
> > On error, this represents the state of the currently running context at
> > the time it was loaded.
> > 
> > Unfortunately, since we're hung and can't switch out the context this
> > may not tell us too much about the most current state of the context,
> > but does give clues about what has happened since loading.
> > 
> > Thanks to recent doc updates, we have a little more confidence regarding
> > what is actually in this memory, and perhaps it will help us gain more
> > insight into certain bugs. AFAICT, the most interesting info is in the
> > first page. To save space, we only capture the first page. In the
> > future, we might want to dump more.
> > 
> > Sample of the relevant part of error state:
> > --- HW Context = 0x01b20000
> > 00000000 :  00000000 1100105f 00002028 ffff0880
> > 00000010 :  0000209c feff4040 000020c0 efdf0080
> > 00000020 :  00002178 00000001 0000217c 00145855
> > 00000030 :  00002310 00000000 00002314 00000000
> > 00000040 :  00002318 00000000 0000231c 00000000
> 
> Presentation looks reasonable, except it will confuse
> intel_error_decode as it will match "%x : %x". How about
> "[%03x] %08x %08x %08x %08x"?

Will %4x be okay? I suspect 1 page may not be sufficient at some point.
Also, it's probably worth modifying intel_error_decode since I suspect
over time we'll want to dump more objects.

> 
> > 
> > References: https://bugs.freedesktop.org/show_bug.cgi?id=55845
> > Cc: Chris Wilson <chris at chris-wilson.co.uk>
> > Signed-off-by: Ben Widawsky <ben at bwidawsk.net>
> > ---
> 
> > diff --git a/drivers/gpu/drm/i915/i915_drv.h b/drivers/gpu/drm/i915/i915_drv.h
> > index e95337c..ab88620 100644
> > --- a/drivers/gpu/drm/i915/i915_drv.h
> > +++ b/drivers/gpu/drm/i915/i915_drv.h
> > @@ -209,6 +209,7 @@ struct drm_i915_error_state {
> >  	u32 pgtbl_er;
> >  	u32 ier;
> >  	u32 ccid;
> > +	struct drm_i915_error_object *ctx_obj;
> 
> Put it next to the other pointers; lest we want to start digging holes.

On that request, I am going to make 1 per ring then. We don't have any
non-ring error objects currently. I've also always been in favor of
having the code more ring independent but that was removed because of
Daniel.

> 
> >  	u32 derrmr;
> >  	u32 forcewake;
> >  	bool waiting[I915_NUM_RINGS];
> > diff --git a/drivers/gpu/drm/i915/i915_irq.c b/drivers/gpu/drm/i915/i915_irq.c
> > index ebaf558..7f7d241 100644
> > --- a/drivers/gpu/drm/i915/i915_irq.c
> > +++ b/drivers/gpu/drm/i915/i915_irq.c
> > @@ -1321,6 +1321,14 @@ static void i915_capture_error_state(struct drm_device *dev)
> >  	error->pgtbl_er = I915_READ(PGTBL_ER);
> >  	error->ccid = I915_READ(CCID);
> >  
> > +	if (error->ccid && !dev_priv->hw_contexts_disabled) {
> > +		list_for_each_entry(obj, &dev_priv->mm.active_list, mm_list)
> 
> I am doubtful that the active list will hold the object in all cases, as
> we only put the context obj onto the active list when switching away.
> I'd check the gtt_list to be on the safe side.

Hmm, that shouldn't have been how the code worked when I wrote it, and
I'm too lazy to look if it's changed. The context should be made active
on the switch to it. If we use rc6, and don't do that, that's really bad
because rc6 will automatically save/restore the context (though as long
as it remained pinned we're safe).  In any case, gtt_list is fine with
me, though I still believe if it's not on the active list, it's an
error.

> And ignore what we think of hw_context_disabled - if the CCID randomly
> points to one of our objects, lets attach it.

I can switch to HAS_HW_CONTEXTS for this. I'm really hoping we never see
CCID != 0 when not using contexts though.

> -Chris

> -- 
> Chris Wilson, Intel Open Source Technology Centre

-- 
Ben Widawsky, Intel Open Source Technology Center



More information about the Intel-gfx mailing list