[Intel-gfx] [PATCH v4 2/5] drm/i915: Watchdog timeout: IRQ handler for gen8+

Tue Mar 19 17:52:51 UTC 2019

On Tue, 2019-03-19 at 12:46 +0000, Tvrtko Ursulin wrote:
> On 19/03/2019 12:39, Tvrtko Ursulin wrote:
> > 
> > On 18/03/2019 00:15, Carlos Santa wrote:
> > > On Mon, 2019-03-11 at 10:39 +0000, Tvrtko Ursulin wrote:
> > > > On 08/03/2019 03:16, Carlos Santa wrote:
> > > > > On Fri, 2019-03-01 at 09:36 +0000, Chris Wilson wrote:
> > > > > > > 
> > > > > > 
> > > > > > Quoting Carlos Santa (2019-02-21 02:58:16)
> > > > > > > +#define GEN8_WATCHDOG_1000US(dev_priv)
> > > > > > > watchdog_to_clock_counts(dev_priv, 1000)
> > > > > > > +static void gen8_watchdog_irq_handler(unsigned long
> > > > > > > data)
> > > > > > > +{
> > > > > > > +       struct intel_engine_cs *engine = (struct
> > > > > > > intel_engine_cs
> > > > > > > *)data;
> > > > > > > +       struct drm_i915_private *dev_priv = engine->i915;
> > > > > > > +       unsigned int hung = 0;
> > > > > > > +       u32 current_seqno=0;
> > > > > > > +       char msg[80];
> > > > > > > +       unsigned int tmp;
> > > > > > > +       int len;
> > > > > > > +
> > > > > > > +       /* Stop the counter to prevent further timeout
> > > > > > > interrupts
> > > > > > > */
> > > > > > > +       I915_WRITE_FW(RING_CNTR(engine->mmio_base),
> > > > > > > get_watchdog_disable(engine));
> > > > > > > +
> > > > > > > +       /* Read the heartbeat seqno once again to check
> > > > > > > if we
> > > > > > > are
> > > > > > > stuck? */
> > > > > > > +       current_seqno =
> > > > > > > intel_engine_get_hangcheck_seqno(engine);
> > > > > > 
> > > > > > I have said this before, but this doesn't exist either,
> > > > > > it's just
> > > > > > a
> > > > > > temporary glitch in the matrix.
> > > > > > 
> > > > > 
> > > > > Chris, Tvrtko, I need some guidance on how to find the quilty
> > > > > seqno
> > > > > during a hang, can you please advice here what to do?
> > > > 
> > > > When an interrupt fires you need to ascertain whether the same
> > > > request
> > > > which enabled the watchdog is running, correct?
> > > > 
> > > > So I think you would need this, with a disclaimer that I
> > > > haven't
> > > > thought
> > > > about the details really:
> > > > 
> > > > 1. Take a reference to timeline hwsp when setting up the
> > > > watchdog for
> > > > a
> > > > request.
> > > > 
> > > > 2. Store the initial seqno associated with this request.
> > > > 
> > > > 3. Force enable user interrupts.
> > > > 
> > > > 4. When timeout fires, inspect the HWSP seqno to see if the
> > > > request
> > > > completed or not.
> > > > 
> > > > 5. Reset the engine if not completed.
> > > > 
> > > > 6. Put the timeline/hwsp reference.
> > > 
> > > 
> > > static int gen8_emit_bb_start(struct i915_request *rq,
> > >                             u64 offset, u32
> > > len,
> > >                             const unsigned
> > > int flags)
> > > {
> > >     struct i915_timeline *tl;
> > >     u32 seqno;
> > > 
> > >     if (enable_watchdog) {
> > >         /* Start watchdog timer */
> > >         cs = gen8_emit_start_watchdog(rq, cs);
> > >         tl = ce->ring->timeline;
> > >         i915_timeline_get_seqno(tl, rq, &seqno);
> > >         /*Store initial hwsp seqno associated with this request
> > >         engine->watchdog_hwsp_seqno = tl->hwsp_seqno;
> > 
> > You should not need to allocate a new seqno and also having
> > something 
> > stored per engine does not make clear how will you solve out of
> > order.

Understood, I missed that there's a convenience pointer available to us
per request (i.e.,  *hwsp_seqno). On step #1 above you have said to
take a reference to the timeline so I was trying to make a link between
the timeline and the seqno but if the request comes already with a
convenience pointer then we may not need the timeline after all...

However, on v4 of the series I was using
intel_engine_get_hangcheck_seqno(engine) for this purpose, and even
though Chris was against it, I saw that it landed recently on the
tree...

> > 
> > Maybe you just set up the timer, then lets see below..

I think you're suggesting simply not to bother checking for the guilty
seqno in the tasklet and simply reset...

> > 
> > Also, are you not trying to do the software implementation to start
> > with?

Trying to keep it simple with just the h/w timers for now... adding
front/back end to accomodate the s/w timers will just muddy the waters?
Will get to it once we agree on what to do here...

> > 
> > >     }
> > > 
> > > }
> > > 
> > > static void gen8_watchdog_tasklet(unsigned long data)
> > > {
> > >         struct i915_request *rq;
> > > 
> > >         rq = intel_engine_find_active_request(engine);
> > > 
> > >         /* Inspect the watchdog seqno once again for
> > > completion? */
> > >         if (!i915_seqno_passed(engine->watchdog_hwsp_seqno, rq-
> > > > fence.seqno)) {
> > > 
> > >             //Reset Engine
> > >         }
> > > }
> > 
> > What happens if you simply reset without checking anything? You
> > know hw 
> > timer wouldn't have fired if the context wasn't running, correct?

Need to verify this by running some tests then...

> > 
> > (Ignoring the race condition between interrupt raised -> hw
> > interrupt 
> > delivered -> serviced -> tasklet scheduled -> tasklet running.
> > Which may 
> > mean request has completed in the meantime and you reset the engine
> > for 
> > nothing. But this is probably not 100% solvable.)
> 
> Good idea would be to write some tests to exercise some normal and
> more 
> edge case scenarios like coalesced requests, preemption etc.
> Checking 
> which request got reset etc.

Ok, need to try some test cases then.

Regards,
Carlos

> 
> Regards,
> 
> Tvrtko
> 
> > Regards,
> > 
> > Tvrtko
> > 
> > > Tvrtko, is the above acceptable to inspect whether the seqno has
> > > completed?
> > > 
> > > I noticed there's a helper function i915_request_completed(struct
> > > i915_request *rq) but it will require me to modify it in order to
> > > pass
> > > 2 different seqnos.
> > > 
> > > Regards,
> > > Carlos
> > > 
> > > > 
> > > > If the user interrupt fires with the request completed cancel
> > > > the
> > > > above
> > > > operations.
> > > > 
> > > > There could be an inherent race between inspecting the seqno
> > > > and
> > > > deciding to reset. Not sure at the moment what to do. Maybe
> > > > just call
> > > > it
> > > > bad luck?
> > > > 
> > > > I also think for the software implementation you need to force
> > > > no
> > > > request coalescing for contexts with timeout set. Because you
> > > > want
> > > > to
> > > > have 100% defined borders for request in and out - since the
> > > > timeout
> > > > is
> > > > defined per request.
> > > > 
> > > > In this case you don't need the user interrupt for the trailing
> > > > edge
> > > > signal but can use context complete. Maybe putting hooks into
> > > > context_in/out in intel_lrc.c would work under these
> > > > circumstances.
> > > > 
> > > > Also if preempted you need to cancel the timer setup and store
> > > > elapsed
> > > > execution time.
> > > > 
> > > > Or it may make sense to just disable preemption for these
> > > > contexts.
> > > > Otherwise there is no point in trying to mandate the timeout?
> > > > 
> > > > But it is also kind of bad since non-privileged contexts can
> > > > make
> > > > themselves non-preemptable by setting the watchdog timeout.
> > > > 
> > > > Maybe as a compromise we need to automatically apply an
> > > > elevated
> > > > priority level, but not as high to be completely non-
> > > > preemptable.
> > > > Sounds
> > > > like a hard question.
> > > > 
> > > > Regards,
> > > > 
> > > > Tvrtko
> > > 
> > >