[Intel-gfx] [PATCH v4] drm/i915: Flush the ring stop bit after clearing RING_HEAD in reset

Fri May 25 12:54:15 UTC 2018

Quoting Tvrtko Ursulin (2018-05-25 09:36:18)
> 
> On 24/05/2018 14:40, Chris Wilson wrote:
> > Quoting Tvrtko Ursulin (2018-05-24 14:34:41)
> >>
> >> On 19/05/2018 10:04, Chris Wilson wrote:
> >>> Quoting Tvrtko Ursulin (2018-05-18 15:42:00)
> >>>>
> >>>> On 18/05/2018 15:13, Chris Wilson wrote:
> >>>>> Quoting Tvrtko Ursulin (2018-05-18 13:36:52)
> >>>>>>
> >>>>>> On 18/05/2018 13:28, Chris Wilson wrote:
> >>>>>>> Quoting Tvrtko Ursulin (2018-05-18 12:50:41)
> >>>>>>>>
> >>>>>>>> On 18/05/2018 12:10, Chris Wilson wrote:
> >>>>>>>>> Quoting Tvrtko Ursulin (2018-05-18 12:05:17)
> >>>>>>>>>>
> >>>>>>>>>> On 18/05/2018 11:09, Chris Wilson wrote:
> >>>>>>>>>>> Inside the live_hangcheck (reset) selftests, we occasionally see
> >>>>>>>>>>> failures like
> >>>>>>>>>>>
> >>>>>>>>>>> <7>[  239.094840] i915_gem_set_wedged rcs0
> >>>>>>>>>>> <7>[  239.094843] i915_gem_set_wedged         current seqno 19a98, last 19a9a, hangcheck 0 [5158 ms]
> >>>>>>>>>>> <7>[  239.094846] i915_gem_set_wedged         Reset count: 6239 (global 1)
> >>>>>>>>>>> <7>[  239.094848] i915_gem_set_wedged         Requests:
> >>>>>>>>>>> <7>[  239.095052] i915_gem_set_wedged                 first  19a99 [e8c:5f] prio=1024 @ 5159ms: (null)
> >>>>>>>>>>> <7>[  239.095056] i915_gem_set_wedged                 last   19a9a [e81:1a] prio=139 @ 5159ms: igt/rcs0[5977]/1
> >>>>>>>>>>> <7>[  239.095059] i915_gem_set_wedged                 active 19a99 [e8c:5f] prio=1024 @ 5159ms: (null)
> >>>>>>>>>>> <7>[  239.095062] i915_gem_set_wedged                 [head 0220, postfix 0280, tail 02a8, batch 0xffffffff_ffffffff]
> >>>>>>>>>>> <7>[  239.100050] i915_gem_set_wedged                 ring->start:  0x00283000
> >>>>>>>>>>> <7>[  239.100053] i915_gem_set_wedged                 ring->head:   0x000001f8
> >>>>>>>>>>> <7>[  239.100055] i915_gem_set_wedged                 ring->tail:   0x000002a8
> >>>>>>>>>>> <7>[  239.100057] i915_gem_set_wedged                 ring->emit:   0x000002a8
> >>>>>>>>>>> <7>[  239.100059] i915_gem_set_wedged                 ring->space:  0x00000f10
> >>>>>>>>>>> <7>[  239.100085] i915_gem_set_wedged         RING_START: 0x00283000
> >>>>>>>>>>> <7>[  239.100088] i915_gem_set_wedged         RING_HEAD:  0x00000260
> >>>>>>>>>>> <7>[  239.100091] i915_gem_set_wedged         RING_TAIL:  0x000002a8
> >>>>>>>>>>> <7>[  239.100094] i915_gem_set_wedged         RING_CTL:   0x00000001
> >>>>>>>>>>> <7>[  239.100097] i915_gem_set_wedged         RING_MODE:  0x00000300 [idle]
> >>>>>>>>>>> <7>[  239.100100] i915_gem_set_wedged         RING_IMR: fffffefe
> >>>>>>>>>>> <7>[  239.100104] i915_gem_set_wedged         ACTHD:  0x00000000_0000609c
> >>>>>>>>>>> <7>[  239.100108] i915_gem_set_wedged         BBADDR: 0x00000000_0000609d
> >>>>>>>>>>> <7>[  239.100111] i915_gem_set_wedged         DMA_FADDR: 0x00000000_00283260
> >>>>>>>>>>> <7>[  239.100114] i915_gem_set_wedged         IPEIR: 0x00000000
> >>>>>>>>>>> <7>[  239.100117] i915_gem_set_wedged         IPEHR: 0x02800000
> >>>>>>>>>>> <7>[  239.100120] i915_gem_set_wedged         Execlist status: 0x00044052 00000002
> >>>>>>>>>>> <7>[  239.100124] i915_gem_set_wedged         Execlist CSB read 5 [5 cached], write 5 [5 from hws], interrupt posted? no, tasklet queued? no (enabled)
> >>>>>>>>>>> <7>[  239.100128] i915_gem_set_wedged                 ELSP[0] count=1, ring->start=00283000, rq: 19a99 [e8c:5f] prio=1024 @ 5164ms: (null)
> >>>>>>>>>>> <7>[  239.100132] i915_gem_set_wedged                 ELSP[1] count=1, ring->start=00257000, rq: 19a9a [e81:1a] prio=139 @ 5164ms: igt/rcs0[5977]/1
> >>>>>>>>>>> <7>[  239.100135] i915_gem_set_wedged                 HW active? 0x5
> >>>>>>>>>>> <7>[  239.100250] i915_gem_set_wedged                 E 19a99 [e8c:5f] prio=1024 @ 5164ms: (null)
> >>>>>>>>>>> <7>[  239.100338] i915_gem_set_wedged                 E 19a9a [e81:1a] prio=139 @ 5164ms: igt/rcs0[5977]/1
> >>>>>>>>>>> <7>[  239.100340] i915_gem_set_wedged                 Queue priority: 139
> >>>>>>>>>>> <7>[  239.100343] i915_gem_set_wedged                 Q 0 [e98:19] prio=132 @ 5164ms: igt/rcs0[5977]/8
> >>>>>>>>>>> <7>[  239.100346] i915_gem_set_wedged                 Q 0 [e84:19] prio=121 @ 5165ms: igt/rcs0[5977]/2
> >>>>>>>>>>> <7>[  239.100349] i915_gem_set_wedged                 Q 0 [e87:19] prio=82 @ 5165ms: igt/rcs0[5977]/3
> >>>>>>>>>>> <7>[  239.100352] i915_gem_set_wedged                 Q 0 [e84:1a] prio=44 @ 5164ms: igt/rcs0[5977]/2
> >>>>>>>>>>> <7>[  239.100356] i915_gem_set_wedged                 Q 0 [e8b:19] prio=20 @ 5165ms: igt/rcs0[5977]/4
> >>>>>>>>>>> <7>[  239.100362] i915_gem_set_wedged         drv_selftest [5894] waiting for 19a99
> >>>>>>>>>>>
> >>>>>>>>>>> where the GPU saw an arbitration point and idles; AND HAS NOT BEEN RESET!
> >>>>>>>>>>> The RING_MODE indicates that is idle and has the STOP_RING bit set, so
> >>>>>>>>>>> try clearing it.
> >>>>>>>>>>>
> >>>>>>>>>>> v2: Only clear the bit on restarting the ring, as we want to be sure the
> >>>>>>>>>>> STOP_RING bit is kept if reset fails on wedging.
> >>>>>>>>>>> v3: Spot when the ring state doesn't make sense when re-initialising the
> >>>>>>>>>>> engine and dump it to the logs so that we don't have to wait for an
> >>>>>>>>>>> error later and try to guess what happened earlier.
> >>>>>>>>>>> v4: Prepare to print all the unexpected state, not just the first.
> >>>>>>>>>>>
> >>>>>>>>>>> Signed-off-by: Chris Wilson <chris at chris-wilson.co.uk>
> >>>>>>>>>>> Cc: Tvrtko Ursulin <tvrtko.ursulin at linux.intel.com>
> >>>>>>>>>>> ---
> >>>>>>>>>>>        drivers/gpu/drm/i915/intel_lrc.c | 22 ++++++++++++++++++++++
> >>>>>>>>>>>        1 file changed, 22 insertions(+)
> >>>>>>>>>>>
> >>>>>>>>>>> diff --git a/drivers/gpu/drm/i915/intel_lrc.c b/drivers/gpu/drm/i915/intel_lrc.c
> >>>>>>>>>>> index 3744f5750624..ba8411ba4abf 100644
> >>>>>>>>>>> --- a/drivers/gpu/drm/i915/intel_lrc.c
> >>>>>>>>>>> +++ b/drivers/gpu/drm/i915/intel_lrc.c
> >>>>>>>>>>> @@ -1781,6 +1781,9 @@ static void enable_execlists(struct intel_engine_cs *engine)
> >>>>>>>>>>>                    I915_WRITE(RING_MODE_GEN7(engine),
> >>>>>>>>>>>                               _MASKED_BIT_ENABLE(GFX_RUN_LIST_ENABLE));
> >>>>>>>>>>>        
> >>>>>>>>>>> +     I915_WRITE(RING_MI_MODE(engine->mmio_base),
> >>>>>>>>>>> +                _MASKED_BIT_DISABLE(STOP_RING));
> >>>>>>>>>>
> >>>>>>>>>> Worries me a bit to clear it unconditionally since documentation says
> >>>>>>>>>> nothing (that I can find) about this scenario.
> >>>>>>>>>>
> >>>>>>>>>>> +
> >>>>>>>>>>>            I915_WRITE(RING_HWS_PGA(engine->mmio_base),
> >>>>>>>>>>>                       engine->status_page.ggtt_offset);
> >>>>>>>>>>>            POSTING_READ(RING_HWS_PGA(engine->mmio_base));
> >>>>>>>>>>> @@ -1789,6 +1792,19 @@ static void enable_execlists(struct intel_engine_cs *engine)
> >>>>>>>>>>>            engine->execlists.csb_head = -1;
> >>>>>>>>>>>        }
> >>>>>>>>>>>        
> >>>>>>>>>>> +static bool unexpected_starting_state(struct intel_engine_cs *engine)
> >>>>>>>>>>> +{
> >>>>>>>>>>> +     struct drm_i915_private *dev_priv = engine->i915;
> >>>>>>>>>>> +     bool unexpected = false;
> >>>>>>>>>>> +
> >>>>>>>>>>> +     if (I915_READ(RING_MI_MODE(engine->mmio_base)) & STOP_RING) {
> >>>>>>>>>>> +             DRM_DEBUG_DRIVER("STOP_RING still set in RING_MI_MODE\n");
> >>>>>>>>>>
> >>>>>>>>>> Also worries me to bury this as a debug message unless we can find in
> >>>>>>>>>> documentation that this is known to happen occasionally.
> >>>>>>>>>>
> >>>>>>>>>> If we had the check and logging immediately after reset, with a more
> >>>>>>>>>> visible log, we could build a larger data set of reported failures and
> >>>>>>>>>> then go back to hw people and ask them about it.
> >>>>>>>>>
> >>>>>>>>> The problem with the reset side is trying to avoid mixing in the case
> >>>>>>>>> where we are faking it and expect it to still be set. On the ringbuffer
> >>>>>>>>> path, we handle STOP_RING as being part of the init sequence. The log
> >>>>>>>>> already says we did a reset and STOP_RING is set (since we have tracing
> >>>>>>>>> through the reset), I just need to find the right spot to handle it.
> >>>>>>>>
> >>>>>>>> The "distance" in code paths between reset and init_hw is probably the
> >>>>>>>> smallest of my concerns. If we could make this message not debug only,
> >>>>>>>> and clear STOP_RING only conditionally I think I would be happy with
> >>>>>>>> giving this a go and see if it collects any data.
> >>>>>>>
> >>>>>>> Why? Who else but us are going to read it, and why bother if it is not
> >>>>>>> actionable. It is logged (was logged already, now highlighted) and for
> >>>>>>> the user none of it is relevant, the only thing that matters to them is
> >>>>>>> that they don't even notice the system suffered a GPU hang.
> >>>>>>
> >>>>>> My reasoning is this - something has happened which is completely
> >>>>>> unexpected and may have other negative effect which are currently
> >>>>>> unknown to us. By logging it at a higher level we collect this now very
> >>>>>> specific signature on a wider collection of platforms and can then act
> >>>>>> on the collected data set. Also very specific log message makes bug
> >>>>>> triage easier and a lower chance it gets bunched up with other hang reports.
> >>>>>
> >>>>> You do have a very specific debug message, but having a flip-flopping
> >>>>> test that has no impact on system stability (as evidenced by the lack of
> >>>>> errors, if there was a subsequent error it is imperative that the post
> >>>>> mortem capture does its best, everything else is lost) does not help us.
> >>>>
> >>>> What if we deploy this as DRM_NOTICE/INFO, get reports from the field
> >>>> and not just the selftest, and then realize the issue is actually
> >>>> interesting and could be fed back to HW design?
> >>>
> >>> But we already have enough to point out that the HW is not behaving as
> >>> we^W I would expect.
> >>>
> >>> For field analysis, I strongly believe that unless you can capture
> >>> everything you need without user intervention, you will never get the
> >>> debug information required. The instructions we have are "oops,
> >>> something went wrong, please file a new bug report here and attach this
> >>> file". That's already too complicated wants to be automated, e.g. abrtd.
> >>> But for whatever reason we don't get the reports from abrtd.
> >>>
> >>> Elsewhere we use the method of light periodic checks and if they spot an
> >>> issue, switch on heavy debug mode. For something like this where we want
> >>> a trace of operations as they happen? I think we need to be able to turn
> >>> on GEM tracing on the fly and be able to capture the compressed buffers.
> >>
> >> Not sure that we need a trace of pre-events for this. It seems pretty
> >> straight-forward to me. Register is not in the state in which it should
> >> be immediately after reset.
> > 
> > But you don't always have a reset, that makes throwing a tantrum here
> > difficult.
> >   
> >> It will help sort the hang reports into a new bucket.
> > 
> > The information is already in the hang report. What isn't which would be
> > useful is the trace of events that precede certain hangs (ones that I
> > suspect are driver bugs as opposed to userspace GPU programming issues).
> 
> It is there but takes more skill to triage.

It is largely pattern matching. I'm sure an AI could do a far better job
at finding discrepancies worth investigating. Add another item to the
wishlist.

> Okay, given that the check is not only after reset, and given how you do 
> a lion's share of bug triaging anyway:
> 
> Reviewed-by: Tvrtko Ursulin <tvrtko.ursulin at intel.com>

Sorry for holding out on not trying to parcel too much into this, thanks
for the review. Pushed hoping that it truly does stop the flip-flops.
-Chris