[Intel-gfx] [PATCH] drm/i915: Always run hangcheck while the GPU is busy
Mika Kuoppala
mika.kuoppala at linux.intel.com
Tue Jan 30 13:00:50 UTC 2018
Chris Wilson <chris at chris-wilson.co.uk> writes:
> Quoting Mika Kuoppala (2018-01-30 12:18:17)
>> Chris Wilson <chris at chris-wilson.co.uk> writes:
>>
>> > Previously, we relied on only running the hangcheck while somebody was
>> > waiting on the GPU, in order to minimise the amount of time hangcheck
>> > had to run. (If nobody was watching the GPU, nobody would notice if the
>> > GPU wasn't responding -- eventually somebody would care and so kick
>> > hangcheck into action.) However, this falls apart from around commit
>> > 4680816be336 ("drm/i915: Wait first for submission, before waiting for
>> > request completion"), as not all waiters declare themselves to hangcheck
>> > and so we could switch off hangcheck and miss GPU hangs even when
>> > waiting under the struct_mutex.
>> >
>> > If we enable hangcheck from the first request submission, and let it run
>> > until the GPU is idle again, we forgo all the complexity involved with
>> > only enabling around waiters. Instead we have to be careful that we do
>> > not declare a GPU hang when idly waiting for the next request to be come
>> > ready.
>>
>> For the complexity part I agree that this is simple and elegant. But
>> I think I have not understood it fully as I don't connect the part where
>> we need to be careful in idly waiting for next request.
>> Could you elaborate and point it the relevant portion in the patch for it?
>
> It's not in this patch, it's just relating to the experiences we've had
> previously in compensating for an engine with requests scheduled waiting
> for a signal, making sure we treated those engines as idle rather than
> stuck.
Ok. Perhaps the last sentence can be omitted then.
I tried to look if we somehow could miss an idle engine check
and declare a false hang if we somehow end up doing a check on
a just idled hardware.
Could not find a clear way that would happen but as
the gt.awake is now a master, should it be first thing we
check in intel_engine_is_idle() to limit how far
we look into the rabbit hole?
-Mika
> -Chris
More information about the Intel-gfx
mailing list