[Intel-gfx] [PATCH] drm/i915: Restore the wait for idle engine after flushing interrupts

Fri Nov 10 12:06:59 UTC 2017

Quoting Mika Kuoppala (2017-11-10 12:00:38)
> Chris Wilson <chris at chris-wilson.co.uk> writes:
> 
> > So it appears that commit 5427f207852d ("drm/i915: Bump wait-times for
> > the final CS interrupt before parking") was a little over optimistic in
> > its belief that it had successfully waited for all residual activity on
> > the engines before parking. Numerous sightings in CI since then of
> >
> > <7>[   52.542886] [IGT] core_auth: executing
> > <3>[   52.561013] [drm:intel_engines_park [i915]] *ERROR* vcs0 is not idle before parking
> > <7>[   52.561215] intel_engines_park vcs0
> > <7>[   52.561229] intel_engines_park  current seqno 98, last 98, hangcheck 0 [-247449 ms], inflight 0
> > <7>[   52.561238] intel_engines_park  Reset count: 0
> > <7>[   52.561266] intel_engines_park  Requests:
> > <7>[   52.561363] intel_engines_park  RING_START: 0x00000000 [0x00000000]
> > <7>[   52.561377] intel_engines_park  RING_HEAD:  0x00000000 [0x00000000]
> > <7>[   52.561390] intel_engines_park  RING_TAIL:  0x00000000 [0x00000000]
> > <7>[   52.561406] intel_engines_park  RING_CTL:   0x00000000
> > <7>[   52.561422] intel_engines_park  RING_MODE:  0x00000200 [idle]
> > <7>[   52.561442] intel_engines_park  ACTHD:  0x00000000_00000000
> > <7>[   52.561459] intel_engines_park  BBADDR: 0x00000000_00000000
> > <7>[   52.561474] intel_engines_park  Execlist status: 0x00000301 00000000
> > <7>[   52.561489] intel_engines_park  Execlist CSB read 5 [5 cached], write 5 [5 from hws], interrupt posted? no
> > <7>[   52.561500] intel_engines_park          ELSP[0] idle
> > <7>[   52.561510] intel_engines_park          ELSP[1] idle
> > <7>[   52.561519] intel_engines_park          HW active? 0x0
> > <7>[   52.561608] intel_engines_park Idle? yes
> > <7>[   52.561617] intel_engines_park
> >
> > on Braswell, which indicates that the engine just needs that little bit
> > longer after flushing the tasklet to settle. So give it a few more
> > milliseconds before declaring an emergency and applying the emergency
> > brake.
> >
> 
> Because the print above indicates that it did went idle straight
> afterwards?

Indeed.

> Just pondering here what was the key nonidleness key that
> lead to this. What raced?

Still pondering myself. This is basically to shut CI up so the random
fails stop inflicting the confusion of false positives.

My guess is that it is a residual ELSP tasklet and the ring registers
taking a moment to idle. But I am not sure, on the way here we gave it
long enough to settle with no new work coming in, so it should not need
another 10ms on top of the 200ms it already had!
-Chris