[Intel-gfx] [PATCH v2] drm/i915/lrc: Scrub the GPU state of the guilty hanging request

Chris Wilson chris at chris-wilson.co.uk
Fri Apr 27 20:35:06 UTC 2018


Quoting Michel Thierry (2018-04-27 21:27:46)
> On 4/27/2018 1:24 PM, Chris Wilson wrote:
> > Previously, we just reset the ring register in the context image such
> > that we could skip over the broken batch and emit the closing
> > breadcrumb. However, on resume the context image and GPU state would be
> > reloaded, which may have been left in an inconsistent state by the
> > reset. The presumption was that at worst it would just cause another
> > reset and skip again until it recovered, however it seems just as likely
> > to cause an unrecoverable hang. Instead of risking loading an incomplete
> > context image, restore it back to the default state.
> > 
> > v2: Fix up off-by-one from including the ppHSWP in with the register
> > state.
> > 
> > Signed-off-by: Chris Wilson <chris at chris-wilson.co.uk>
> > Cc: Mika Kuoppala <mika.kuoppala at linux.intel.com>
> > Cc: MichaƂ Winiarski <michal.winiarski at intel.com>
> > Cc: Michel Thierry <michel.thierry at intel.com>
> > Cc: Tvrtko Ursulin <tvrtko.ursulin at intel.com>
> 
> Reviewed-by: Michel Thierry <michel.thierry at intel.com>
> 
> Does it need a 'Fixes:' tag or has a bugzilla reference?

I suspect it's rare enough that the unrecoverable hang might not be
recognisable in bugzilla. I was just looking at 

https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_4108/fi-bsw-n3050/dmesg0.log

trying to think of ways how the reset might appear to work but the
recovery fail with 

<7>[  521.765114] missed_breadcrumb vecs0 missed breadcrumb at intel_breadcrumbs_hangcheck+0x5a/0x80 [i915]
<7>[  521.765176] missed_breadcrumb 	current seqno e4e, last e4f, hangcheck e4e [2048 ms], inflight 1
<7>[  521.765191] missed_breadcrumb 	Reset count: 0 (global 0)
<7>[  521.765206] missed_breadcrumb 	Requests:
<7>[  521.765223] missed_breadcrumb 		first  e4f [9b82:e4f] prio=0 @ 3766ms: gem_sync[3107]/0
<7>[  521.765239] missed_breadcrumb 		last   e4f [9b82:e4f] prio=0 @ 3766ms: gem_sync[3107]/0
<7>[  521.765256] missed_breadcrumb 		active e4f [9b82:e4f] prio=0 @ 3766ms: gem_sync[3107]/0
<7>[  521.765274] missed_breadcrumb 		[head 3900, postfix 3930, tail 3948, batch 0x00000000_00042000]
<7>[  521.765289] missed_breadcrumb 		ring->start:  0x008ef000
<7>[  521.765301] missed_breadcrumb 		ring->head:   0x000038f8
<7>[  521.765313] missed_breadcrumb 		ring->tail:   0x00003948
<7>[  521.765325] missed_breadcrumb 		ring->emit:   0x00003950
<7>[  521.765337] missed_breadcrumb 		ring->space:  0x00002618
<7>[  521.765372] missed_breadcrumb 	RING_START: 0x008ef000
<7>[  521.765389] missed_breadcrumb 	RING_HEAD:  0x000038f8
<7>[  521.765404] missed_breadcrumb 	RING_TAIL:  0x00003948
<7>[  521.765422] missed_breadcrumb 	RING_CTL:   0x00003001
<7>[  521.765438] missed_breadcrumb 	RING_MODE:  0x00000000
<7>[  521.765453] missed_breadcrumb 	RING_IMR: fffffefe
<7>[  521.765473] missed_breadcrumb 	ACTHD:  0x00000000_022039b8
<7>[  521.765492] missed_breadcrumb 	BBADDR: 0x00000000_00042004
<7>[  521.765511] missed_breadcrumb 	DMA_FADDR: 0x00000000_008f28f8
<7>[  521.765537] missed_breadcrumb 	IPEIR: 0x00000000
<7>[  521.765552] missed_breadcrumb 	IPEHR: 0x11000011
<7>[  521.765570] missed_breadcrumb 	Execlist status: 0x00044032 00000002
<7>[  521.765586] missed_breadcrumb 	Execlist CSB read 1 [1 cached], write 2 [2 from hws], interrupt posted? no, tasklet queued? no (enabled)
<7>[  521.765604] missed_breadcrumb 	Execlist CSB[2]: 0x00000001 [0x00000001 in hwsp], context: 0 [0 in hwsp]
<7>[  521.765619] missed_breadcrumb 		ELSP[0] count=1, rq: e4f [9b82:e4f] prio=0 @ 3767ms: gem_sync[3107]/0
<7>[  521.765632] missed_breadcrumb 		ELSP[1] idle
<7>[  521.765645] missed_breadcrumb 		HW active? 0x1
<7>[  521.765660] missed_breadcrumb 		E e4f [9b82:e4f] prio=0 @ 3767ms: gem_sync[3107]/0
<7>[  521.765670] missed_breadcrumb 		Queue priority: -2147483648
<7>[  521.765684] missed_breadcrumb 	gem_sync [3112] waiting for e4f
<7>[  521.765697] missed_breadcrumb IRQ? 0x1 (breadcrumbs? yes) (execlists? no)
<7>[  521.765707] missed_breadcrumb HWSP:
<7>[  521.765723] missed_breadcrumb 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000
<7>[  521.765733] missed_breadcrumb *
<7>[  521.765747] missed_breadcrumb 00000040 00000001 00000000 00000018 00000002 00000001 00000000 00000018 00000002
<7>[  521.765760] missed_breadcrumb 00000060 00000001 00000000 00000018 00000002 00000000 00000000 00000000 00000002
<7>[  521.765774] missed_breadcrumb 00000080 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000
<7>[  521.765784] missed_breadcrumb *
<7>[  521.765809] missed_breadcrumb 000000c0 00000e4e 00000000 00000000 00000000 00000000 00000000 00000000 00000000
<7>[  521.765823] missed_breadcrumb 000000e0 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000
<7>[  521.765833] missed_breadcrumb *
<7>[  521.765845] missed_breadcrumb Idle? no

Of particular note being the IPEHR being MI_LRI, the ring being idle (it
hasn't moved on from the earlier reset) and the fetch address being
unconnected to the rings, so naturally I assume it died loading the
context image on resume.
-Chris


More information about the Intel-gfx mailing list