[Intel-gfx] [PATCH] drm/i915: Always run hangcheck while the GPU is busy

Tue Jan 30 15:52:15 UTC 2018

Quoting Mika Kuoppala (2018-01-30 13:00:50)
> Chris Wilson <chris at chris-wilson.co.uk> writes:
> 
> > Quoting Mika Kuoppala (2018-01-30 12:18:17)
> >> Chris Wilson <chris at chris-wilson.co.uk> writes:
> >> 
> >> > Previously, we relied on only running the hangcheck while somebody was
> >> > waiting on the GPU, in order to minimise the amount of time hangcheck
> >> > had to run. (If nobody was watching the GPU, nobody would notice if the
> >> > GPU wasn't responding -- eventually somebody would care and so kick
> >> > hangcheck into action.) However, this falls apart from around commit
> >> > 4680816be336 ("drm/i915: Wait first for submission, before waiting for
> >> > request completion"), as not all waiters declare themselves to hangcheck
> >> > and so we could switch off hangcheck and miss GPU hangs even when
> >> > waiting under the struct_mutex.
> >> >
> >> > If we enable hangcheck from the first request submission, and let it run
> >> > until the GPU is idle again, we forgo all the complexity involved with
> >> > only enabling around waiters. Instead we have to be careful that we do
> >> > not declare a GPU hang when idly waiting for the next request to be come
> >> > ready.
> >> 
> >> For the complexity part I agree that this is simple and elegant. But
> >> I think I have not understood it fully as I don't connect the part where
> >> we need to be careful in idly waiting for next request.
> >> Could you elaborate and point it the relevant portion in the patch for it?
> >
> > It's not in this patch, it's just relating to the experiences we've had
> > previously in compensating for an engine with requests scheduled waiting
> > for a signal, making sure we treated those engines as idle rather than
> > stuck.
> 
> Ok. Perhaps the last sentence can be omitted then.

Just felt it was an important reminder that as we don't switch off as
often, hangcheck needs to be wary of frequently finding itself in
situations where the driver is busy but the engine is idle. We've fallen
foul of that in the past, a hint to [future] reviewers that we need to
double check all will be well.

> I tried to look if we somehow could miss an idle engine check
> and declare a false hang if we somehow end up doing a check on
> a just idled hardware.
> 
> Could not find a clear way that would happen but as
> the gt.awake is now a master, should it be first thing we
> check in intel_engine_is_idle() to limit how far
> we look into the rabbit hole?

The other argument being that the user parameter trumps gt.awake.
I think the order is reasonably clear atm. What I'm not completely sold
on is the behaviour around wedging -- the requeue should be from unwedge
in that case and not inside i915_reset(). But I didn't look at all the
implications of moving that i915_queue_hangcheck() and so left it alone
for now.
-Chris