[Intel-gfx] [PATCH v2] drm/i915/lrc: Scrub the GPU state of the guilty hanging request
Chris Wilson
chris at chris-wilson.co.uk
Fri Apr 27 20:35:06 UTC 2018
Quoting Michel Thierry (2018-04-27 21:27:46)
> On 4/27/2018 1:24 PM, Chris Wilson wrote:
> > Previously, we just reset the ring register in the context image such
> > that we could skip over the broken batch and emit the closing
> > breadcrumb. However, on resume the context image and GPU state would be
> > reloaded, which may have been left in an inconsistent state by the
> > reset. The presumption was that at worst it would just cause another
> > reset and skip again until it recovered, however it seems just as likely
> > to cause an unrecoverable hang. Instead of risking loading an incomplete
> > context image, restore it back to the default state.
> >
> > v2: Fix up off-by-one from including the ppHSWP in with the register
> > state.
> >
> > Signed-off-by: Chris Wilson <chris at chris-wilson.co.uk>
> > Cc: Mika Kuoppala <mika.kuoppala at linux.intel.com>
> > Cc: MichaĆ Winiarski <michal.winiarski at intel.com>
> > Cc: Michel Thierry <michel.thierry at intel.com>
> > Cc: Tvrtko Ursulin <tvrtko.ursulin at intel.com>
>
> Reviewed-by: Michel Thierry <michel.thierry at intel.com>
>
> Does it need a 'Fixes:' tag or has a bugzilla reference?
I suspect it's rare enough that the unrecoverable hang might not be
recognisable in bugzilla. I was just looking at
https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_4108/fi-bsw-n3050/dmesg0.log
trying to think of ways how the reset might appear to work but the
recovery fail with
<7>[ 521.765114] missed_breadcrumb vecs0 missed breadcrumb at intel_breadcrumbs_hangcheck+0x5a/0x80 [i915]
<7>[ 521.765176] missed_breadcrumb current seqno e4e, last e4f, hangcheck e4e [2048 ms], inflight 1
<7>[ 521.765191] missed_breadcrumb Reset count: 0 (global 0)
<7>[ 521.765206] missed_breadcrumb Requests:
<7>[ 521.765223] missed_breadcrumb first e4f [9b82:e4f] prio=0 @ 3766ms: gem_sync[3107]/0
<7>[ 521.765239] missed_breadcrumb last e4f [9b82:e4f] prio=0 @ 3766ms: gem_sync[3107]/0
<7>[ 521.765256] missed_breadcrumb active e4f [9b82:e4f] prio=0 @ 3766ms: gem_sync[3107]/0
<7>[ 521.765274] missed_breadcrumb [head 3900, postfix 3930, tail 3948, batch 0x00000000_00042000]
<7>[ 521.765289] missed_breadcrumb ring->start: 0x008ef000
<7>[ 521.765301] missed_breadcrumb ring->head: 0x000038f8
<7>[ 521.765313] missed_breadcrumb ring->tail: 0x00003948
<7>[ 521.765325] missed_breadcrumb ring->emit: 0x00003950
<7>[ 521.765337] missed_breadcrumb ring->space: 0x00002618
<7>[ 521.765372] missed_breadcrumb RING_START: 0x008ef000
<7>[ 521.765389] missed_breadcrumb RING_HEAD: 0x000038f8
<7>[ 521.765404] missed_breadcrumb RING_TAIL: 0x00003948
<7>[ 521.765422] missed_breadcrumb RING_CTL: 0x00003001
<7>[ 521.765438] missed_breadcrumb RING_MODE: 0x00000000
<7>[ 521.765453] missed_breadcrumb RING_IMR: fffffefe
<7>[ 521.765473] missed_breadcrumb ACTHD: 0x00000000_022039b8
<7>[ 521.765492] missed_breadcrumb BBADDR: 0x00000000_00042004
<7>[ 521.765511] missed_breadcrumb DMA_FADDR: 0x00000000_008f28f8
<7>[ 521.765537] missed_breadcrumb IPEIR: 0x00000000
<7>[ 521.765552] missed_breadcrumb IPEHR: 0x11000011
<7>[ 521.765570] missed_breadcrumb Execlist status: 0x00044032 00000002
<7>[ 521.765586] missed_breadcrumb Execlist CSB read 1 [1 cached], write 2 [2 from hws], interrupt posted? no, tasklet queued? no (enabled)
<7>[ 521.765604] missed_breadcrumb Execlist CSB[2]: 0x00000001 [0x00000001 in hwsp], context: 0 [0 in hwsp]
<7>[ 521.765619] missed_breadcrumb ELSP[0] count=1, rq: e4f [9b82:e4f] prio=0 @ 3767ms: gem_sync[3107]/0
<7>[ 521.765632] missed_breadcrumb ELSP[1] idle
<7>[ 521.765645] missed_breadcrumb HW active? 0x1
<7>[ 521.765660] missed_breadcrumb E e4f [9b82:e4f] prio=0 @ 3767ms: gem_sync[3107]/0
<7>[ 521.765670] missed_breadcrumb Queue priority: -2147483648
<7>[ 521.765684] missed_breadcrumb gem_sync [3112] waiting for e4f
<7>[ 521.765697] missed_breadcrumb IRQ? 0x1 (breadcrumbs? yes) (execlists? no)
<7>[ 521.765707] missed_breadcrumb HWSP:
<7>[ 521.765723] missed_breadcrumb 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000
<7>[ 521.765733] missed_breadcrumb *
<7>[ 521.765747] missed_breadcrumb 00000040 00000001 00000000 00000018 00000002 00000001 00000000 00000018 00000002
<7>[ 521.765760] missed_breadcrumb 00000060 00000001 00000000 00000018 00000002 00000000 00000000 00000000 00000002
<7>[ 521.765774] missed_breadcrumb 00000080 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000
<7>[ 521.765784] missed_breadcrumb *
<7>[ 521.765809] missed_breadcrumb 000000c0 00000e4e 00000000 00000000 00000000 00000000 00000000 00000000 00000000
<7>[ 521.765823] missed_breadcrumb 000000e0 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000
<7>[ 521.765833] missed_breadcrumb *
<7>[ 521.765845] missed_breadcrumb Idle? no
Of particular note being the IPEHR being MI_LRI, the ring being idle (it
hasn't moved on from the earlier reset) and the fetch address being
unconnected to the rings, so naturally I assume it died loading the
context image on resume.
-Chris
More information about the Intel-gfx
mailing list