[Intel-gfx] [PATCH v2] drm/i915/lrc: Scrub the GPU state of the guilty hanging request

Michel Thierry michel.thierry at intel.com
Fri Apr 27 22:30:06 UTC 2018


On 4/27/2018 1:35 PM, Chris Wilson wrote:
> Quoting Michel Thierry (2018-04-27 21:27:46)
>> On 4/27/2018 1:24 PM, Chris Wilson wrote:
>>> Previously, we just reset the ring register in the context image such
>>> that we could skip over the broken batch and emit the closing
>>> breadcrumb. However, on resume the context image and GPU state would be
>>> reloaded, which may have been left in an inconsistent state by the
>>> reset. The presumption was that at worst it would just cause another
>>> reset and skip again until it recovered, however it seems just as likely
>>> to cause an unrecoverable hang. Instead of risking loading an incomplete
>>> context image, restore it back to the default state.
>>>
>>> v2: Fix up off-by-one from including the ppHSWP in with the register
>>> state.
>>>
>>> Signed-off-by: Chris Wilson <chris at chris-wilson.co.uk>
>>> Cc: Mika Kuoppala <mika.kuoppala at linux.intel.com>
>>> Cc: MichaƂ Winiarski <michal.winiarski at intel.com>
>>> Cc: Michel Thierry <michel.thierry at intel.com>
>>> Cc: Tvrtko Ursulin <tvrtko.ursulin at intel.com>
>>
>> Reviewed-by: Michel Thierry <michel.thierry at intel.com>
>>
>> Does it need a 'Fixes:' tag or has a bugzilla reference?
> 
> I suspect it's rare enough that the unrecoverable hang might not be
> recognisable in bugzilla. I was just looking at
> 
> https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_4108/fi-bsw-n3050/dmesg0.log
> 
> trying to think of ways how the reset might appear to work but the
> recovery fail with
> 
> <7>[  521.765114] missed_breadcrumb vecs0 missed breadcrumb at intel_breadcrumbs_hangcheck+0x5a/0x80 [i915]
> <7>[  521.765176] missed_breadcrumb 	current seqno e4e, last e4f, hangcheck e4e [2048 ms], inflight 1
> <7>[  521.765191] missed_breadcrumb 	Reset count: 0 (global 0)
> <7>[  521.765206] missed_breadcrumb 	Requests:
> <7>[  521.765223] missed_breadcrumb 		first  e4f [9b82:e4f] prio=0 @ 3766ms: gem_sync[3107]/0
> <7>[  521.765239] missed_breadcrumb 		last   e4f [9b82:e4f] prio=0 @ 3766ms: gem_sync[3107]/0
> <7>[  521.765256] missed_breadcrumb 		active e4f [9b82:e4f] prio=0 @ 3766ms: gem_sync[3107]/0
> <7>[  521.765274] missed_breadcrumb 		[head 3900, postfix 3930, tail 3948, batch 0x00000000_00042000]
> <7>[  521.765289] missed_breadcrumb 		ring->start:  0x008ef000
> <7>[  521.765301] missed_breadcrumb 		ring->head:   0x000038f8
> <7>[  521.765313] missed_breadcrumb 		ring->tail:   0x00003948
> <7>[  521.765325] missed_breadcrumb 		ring->emit:   0x00003950
> <7>[  521.765337] missed_breadcrumb 		ring->space:  0x00002618
> <7>[  521.765372] missed_breadcrumb 	RING_START: 0x008ef000
> <7>[  521.765389] missed_breadcrumb 	RING_HEAD:  0x000038f8
> <7>[  521.765404] missed_breadcrumb 	RING_TAIL:  0x00003948
> <7>[  521.765422] missed_breadcrumb 	RING_CTL:   0x00003001
> <7>[  521.765438] missed_breadcrumb 	RING_MODE:  0x00000000
> <7>[  521.765453] missed_breadcrumb 	RING_IMR: fffffefe
> <7>[  521.765473] missed_breadcrumb 	ACTHD:  0x00000000_022039b8
> <7>[  521.765492] missed_breadcrumb 	BBADDR: 0x00000000_00042004
> <7>[  521.765511] missed_breadcrumb 	DMA_FADDR: 0x00000000_008f28f8
> <7>[  521.765537] missed_breadcrumb 	IPEIR: 0x00000000
> <7>[  521.765552] missed_breadcrumb 	IPEHR: 0x11000011
> <7>[  521.765570] missed_breadcrumb 	Execlist status: 0x00044032 00000002
> <7>[  521.765586] missed_breadcrumb 	Execlist CSB read 1 [1 cached], write 2 [2 from hws], interrupt posted? no, tasklet queued? no (enabled)
> <7>[  521.765604] missed_breadcrumb 	Execlist CSB[2]: 0x00000001 [0x00000001 in hwsp], context: 0 [0 in hwsp]
> <7>[  521.765619] missed_breadcrumb 		ELSP[0] count=1, rq: e4f [9b82:e4f] prio=0 @ 3767ms: gem_sync[3107]/0
> <7>[  521.765632] missed_breadcrumb 		ELSP[1] idle
> <7>[  521.765645] missed_breadcrumb 		HW active? 0x1
> <7>[  521.765660] missed_breadcrumb 		E e4f [9b82:e4f] prio=0 @ 3767ms: gem_sync[3107]/0
> <7>[  521.765670] missed_breadcrumb 		Queue priority: -2147483648
> <7>[  521.765684] missed_breadcrumb 	gem_sync [3112] waiting for e4f
> <7>[  521.765697] missed_breadcrumb IRQ? 0x1 (breadcrumbs? yes) (execlists? no)
> <7>[  521.765707] missed_breadcrumb HWSP:
> <7>[  521.765723] missed_breadcrumb 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000
> <7>[  521.765733] missed_breadcrumb *
> <7>[  521.765747] missed_breadcrumb 00000040 00000001 00000000 00000018 00000002 00000001 00000000 00000018 00000002
> <7>[  521.765760] missed_breadcrumb 00000060 00000001 00000000 00000018 00000002 00000000 00000000 00000000 00000002
> <7>[  521.765774] missed_breadcrumb 00000080 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000
> <7>[  521.765784] missed_breadcrumb *
> <7>[  521.765809] missed_breadcrumb 000000c0 00000e4e 00000000 00000000 00000000 00000000 00000000 00000000 00000000
> <7>[  521.765823] missed_breadcrumb 000000e0 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000
> <7>[  521.765833] missed_breadcrumb *
> <7>[  521.765845] missed_breadcrumb Idle? no
> 
> Of particular note being the IPEHR being MI_LRI, the ring being idle (it
> hasn't moved on from the earlier reset) and the fetch address being
> unconnected to the rings, so naturally I assume it died loading the
> context image on resume.
Plus it is a bsw...
Agreed, this looks like an issue during the ctx restore.

> -Chris
> 


More information about the Intel-gfx mailing list