[Intel-gfx] [PATCH] drm/i915/execlists: Double check breadcrumb before crying foul

Tue Mar 31 08:52:21 UTC 2020

Chris Wilson <chris at chris-wilson.co.uk> writes:

> Quoting Mika Kuoppala (2020-03-31 08:50:54)
>> Chris Wilson <chris at chris-wilson.co.uk> writes:
>> 
>> >   process_csb: 0000:00:02.0 bcs0: cs-irq head=4, tail=5
>> >   process_csb: 0000:00:02.0 bcs0: csb[5]: status=0x00008002:0x60000020
>> >   trace_ports: 0000:00:02.0 bcs0: preempted { ff84:45154! prio 2 }
>> >   trace_ports: 0000:00:02.0 bcs0: promote { ff84:45155* prio 2 }
>> >   trace_ports: 0000:00:02.0 bcs0: submit { ff84:45156 prio 2 }
>> >
>> >   process_csb: 0000:00:02.0 bcs0: cs-irq head=5, tail=6
>> >   process_csb: 0000:00:02.0 bcs0: csb[6]: status=0x00000018:0x60000020
>> >   trace_ports: 0000:00:02.0 bcs0: completed { ff84:45155* prio 2 }
>> >   process_csb: 0000:00:02.0 bcs0: ring:{start:0x00178000, head:0928, tail:0928, ctl:00000000, mode:00000200}
>> >   process_csb: 0000:00:02.0 bcs0: rq:{start:00178000, head:08b0, tail:08f0, seqno:ff84:45155, hwsp:45156},
>> >   process_csb: 0000:00:02.0 bcs0: ctx:{start:00178000, head:e000928, tail:0928},
>> >   process_csb: GEM_BUG_ON("context completed before request")
>> >
>> > In this sequence, we can see that although we have submitted the next
>> > request [ff84:45156] to HW (via ELSP[]) it has not yet reported the
>> > lite-restore. Instead, we see the completion event of the currently
>> > active request [ff85:45155] but at the time of processing that event,
>> > the breadcrumb has not yet been written. Though by the time we do print
>> > out the debug info, the seqno write of ff85:45156 has landed!
>> 
>> I see it.
>> 
>> But I have now a more generic confusion about the sequence:
>> why would the same context preempt itself?
>
> lite-restore continuations.
>
>> 
>> >   trace_ports: 0000:00:02.0 bcs0: preempted { ff84:45154! prio 2 }
>> >   trace_ports: 0000:00:02.0 bcs0: promote { ff84:45155* prio 2 }
>> 
>> >
>> > Therefore there is a serialisation problem between the seqno writes and
>> > CS events, not just between the CS buffer and its head/tail pointers as
>> > previously observed on Icelake.
>> >
>> > This is not a huge problem, as we don't strictly rely on the breadcrumb
>> > to determine HW activity, but it may indicate that interrupt delivery is
>> > before the seqno write, aka bringing back the plague of missed
>> > interrupts from yesteryear. However, there is no indication of this
>> > wider problem, so let's just flush the seqno read before reporting an
>> > error. If it persists after the fresh read we can worry again.
>> 
>> Well you are preempting part of my concerns.
>> 
>> Still you will have a serialision point now only with debug builds.
>
> We only check in debug.
>
>> So I assume the rest of the request completion flow is ambivalent of
>> intr seqno coherency and we can cope with this is only for debug...
>
> Right.
>  
>> Being in generic read hwsp path would be expensive and thus overkill?
>
> Very expensive. There's no evidence that the GPU is that broken that it
> returns garbage in a *cached* CPU read.
>  
>> Could we lift the flushing to the non show_debug path to be always
>> part of process_csb cycle?
>
> No. What would be the point? The breadcrumb is not part of the CS flow.
> The debug is our sanitycheck that the two events are ordered, but that
> is not a strict requirement of the code. If it fails this test, the
> presumption is that either the GPU is very dead (e.g. the GO problem,
> RING_HEAD/TAIL issues, a variety of other means of killing the gpu) and
> the problem will remain, or that we have a programming error, which
> again will not resolve itself as we print the debug.

Agreed that there is no point if it is for forcing a fresh
read only for debug. Concern was that it affects the
validity for other reads.

But you amended that on the commit message, preemptively.
So,
Reviewed-by: Mika Kuoppala <mika.kuoppala at linux.intel.com>