[Intel-gfx] [PATCH] drm/i915: Disable per-engine reset for Broxton
Michel Thierry
michel.thierry at intel.com
Thu Jul 6 17:02:13 UTC 2017
On Thu, Jul 6, 2017 at 12:11 AM, Chris Wilson
<chris at chris-wilson.co.uk> wrote:
> Quoting Michel Thierry (2017-07-06 02:24:26)
>> On 04/07/17 09:09, Chris Wilson wrote:
>> > Triggering a GPU reset for one engine affects another, notably
>> > corrupting the context status buffer (CSB) effectively losing
>> track of
>> > inflight requests.
>> >
>> > Adding a few printks:
>> > diff --git a/drivers/gpu/drm/i915/i915_drv.c
>> b/drivers/gpu/drm/i915/i915_drv.c
>> > index ad41836fa5e5..a969456bc0fa 100644
>> > --- a/drivers/gpu/drm/i915/i915_drv.c
>> > +++ b/drivers/gpu/drm/i915/i915_drv.c
>> > @@ -1953,6 +1953,7 @@ int i915_reset_engine(struct
>> intel_engine_cs *engine)
>> > goto out;
>> > }
>> >
>> > + pr_err("Resetting %s\n", engine->name);
>> > ret = intel_gpu_reset(engine->i915,
>> intel_engine_flag(engine));
>> > if (ret) {
>> > /* If we fail here, we expect to fallback to a
>> global reset */
>> > diff --git a/drivers/gpu/drm/i915/intel_lrc.c
>> b/drivers/gpu/drm/i915/intel_lrc.c
>> > index 716e5c9ea222..a72bc35d0870 100644
>> > --- a/drivers/gpu/drm/i915/intel_lrc.c
>> > +++ b/drivers/gpu/drm/i915/intel_lrc.c
>> > @@ -355,6 +355,7 @@ static void execlists_submit_ports(struct
>> intel_engine_cs *engine)
>> >
>> execlists_context_status_change(rq, INTEL_CONTEXT_SCHEDULE_IN);
>> > port_set(&port[n], port_pack(rq, count));
>> > desc = execlists_update_context(rq);
>> > + pr_err("%s: in (rq=%x) ctx=%d\n",
>> engine->name, rq->global_seqno, upper_32_bits(desc));
>> > GEM_DEBUG_EXEC(port[n].context_id =
>> upper_32_bits(desc));
>> > } else {
>> > GEM_BUG_ON(!n);
>> > @@ -594,9 +595,23 @@ static void intel_lrc_irq_handler(unsigned
>> long data)
>> > if (!(status &
>> GEN8_CTX_STATUS_COMPLETED_MASK))
>> > continue;
>> >
>> > + pr_err("%s: out CSB (%x head=%d,
>> tail=%d), ctx=%d, rq=%d\n",
>> > + engine->name,
>> > + readl(csb_mmio),
>> > + head, tail,
>> > + readl(buf+2*head+1),
>> > + port->context_id);
>> > +
>> > /* Check the context/desc id for this
>> event matches */
>> > - GEM_DEBUG_BUG_ON(readl(buf + 2 * head +
>> 1) !=
>> > - port->context_id);
>> > + if (readl(buf + 2 * head + 1) !=
>> port->context_id) {
>> > + pr_err("%s: BUG CSB (%x head=%d,
>> tail=%d), ctx=%d, rq=%d\n",
>> > + engine->name,
>> > + readl(csb_mmio),
>> > + head, tail,
>> > +
>> readl(buf+2*head+1),
>> > + port->context_id);
>> > + BUG();
>> > + }
>> >
>> > rq = port_unpack(port, &count);
>> > GEM_BUG_ON(count == 0);
>> >
>> > Results in:
>> >
>> > [ 6423.006602] Resetting rcs0
>> > [ 6423.009080] rcs0: in (rq=fffffe70) ctx=1
>> > [ 6423.009216] rcs0: in (rq=fffffe6f) ctx=3
>> > [ 6423.009542] rcs0: out CSB (2 head=1, tail=2), ctx=3, rq=3
>> > [ 6423.009619] Resetting bcs0
>> > [ 6423.009980] rcs0: BUG CSB (0 head=1, tail=2), ctx=0, rq=3
>> >
>>
>> It took me a while, but I was able to replicate this (with your
>> igt_reset_active_engines) a couple of times. I also captured the
>> value
>> of the CSB events at that point and it looked like this.
>
> Ah, that's a separate issue that definitely isn't limited to bxt. In
> the
> bug on my machine the CSB is distinctly zeroed.
Oh, if you saw them being zeroed, then there's not much I can argue (do
you remember if the CSB write pointer was reset to '7' too?).
Hopefully is only limited to bxt, so sadly
Reviewed-by: Michel Thierry <michel.thierry at intel.com>
>
>
>> [ 55.134393] Resetting rcs0
>> [ 55.134747] bcs0: BUG CSB (3 head=1, tail=2), ctx=10, rq=4
>> [ 55.134755] bcs0: HWSP[16-17] Execlist CSB[0]: 0x00000018 _
>> 0x0000000a
>> [ 55.134759] bcs0: HWSP[18-19] Execlist CSB[1]: 0x00000012 _
>> 0x0000000a
>> [ 55.134762] bcs0: HWSP[20-21] Execlist CSB[2]: 0x00008002 _
>> 0x00000004
>> [ 55.134765] bcs0: HWSP[22-23] Execlist CSB[3]: 0x00000014 _
>> 0x00000004
>> [ 55.134767] bcs0: HWSP[24-25] Execlist CSB[4]: 0x00000018 _
>> 0x0000000a
>> [ 55.134770] bcs0: HWSP[26-27] Execlist CSB[5]: 0x00000001 _
>> 0x00000000
>>
>> The problem is ctx 10 finished in CSB[0] (ctx_complete &
>> active_to_idle), but then somehow CSB[1] has the same ctx 10 with
>> 'preempted' & ctx_complete.
>>
>> To make things worse, in CSB[2], the hw claims to have
>> lite-restored ctx 4.
>>
>> So it seems, we could ignore events like CSB[1], i.e. preempted &&
>> ctx_complete && !lite_restore.
>
> I think this bug is fixed in the series I sent last week, this is just
> the race in reset vs the tasklet.
I'll try with those patches.
Thanks,
>
> -Chris
More information about the Intel-gfx
mailing list