[Intel-gfx] [PATCH v6 16/20] drm/i915: Watchdog timeout: IRQ handler for gen8+

Wed Apr 19 18:13:15 UTC 2017

On 19/04/17 10:51, Chris Wilson wrote:
> On Wed, Apr 19, 2017 at 10:11:37AM -0700, Michel Thierry wrote:
>>
>>
>> On 19/04/17 03:20, Chris Wilson wrote:
>>> On Tue, Apr 18, 2017 at 01:23:31PM -0700, Michel Thierry wrote:
>>>> *** General ***
>>>>
>>>> Watchdog timeout (or "media engine reset") is a feature that allows
>>>> userland applications to enable hang detection on individual batch buffers.
>>>> The detection mechanism itself is mostly bound to the hardware and the only
>>>> thing that the driver needs to do to support this form of hang detection
>>>> is to implement the interrupt handling support as well as watchdog command
>>>> emission before and after the emitted batch buffer start instruction in the
>>>> ring buffer.
>>>>
>>>> The principle of the hang detection mechanism is as follows:
>>>>
>>>> 1. Once the decision has been made to enable watchdog timeout for a
>>>> particular batch buffer and the driver is in the process of emitting the
>>>> batch buffer start instruction into the ring buffer it also emits a
>>>> watchdog timer start instruction before and a watchdog timer cancellation
>>>> instruction after the batch buffer start instruction in the ring buffer.
>>>>
>>>> 2. Once the GPU execution reaches the watchdog timer start instruction
>>>> the hardware watchdog counter is started by the hardware. The counter
>>>> keeps counting until either reaching a previously configured threshold
>>>> value or the timer cancellation instruction is executed.
>>>>
>>>> 2a. If the counter reaches the threshold value the hardware fires a
>>>> watchdog interrupt that is picked up by the watchdog interrupt handler.
>>>> This means that a hang has been detected and the driver needs to deal with
>>>> it the same way it would deal with a engine hang detected by the periodic
>>>> hang checker. The only difference between the two is that we already blamed
>>>> the active request (to ensure an engine reset).
>>>>
>>>> 2b. If the batch buffer completes and the execution reaches the watchdog
>>>> cancellation instruction before the watchdog counter reaches its
>>>> threshold value the watchdog is cancelled and nothing more comes of it.
>>>> No hang is detected.
>>>>
>>>> Note about future interaction with preemption: Preemption could happen
>>>> in a command sequence prior to watchdog counter getting disabled,
>>>> resulting in watchdog being triggered following preemption. The driver will
>>>> need to explicitly disable the watchdog counter as part of the
>>>> preemption sequence.
>>>
>>> Does MI_ARB_ON_OFF do the trick? Shouldn't we basically be only turning
>>> preemption on for the user buffers as it just causes hassle if we allow
>>> preemption in our preamble + breadcrumb. (And there's little point in
>>> preempting in the flushes.)
>>>
>>
>> Mid-batch?
>> The watchdog counter is not aware of MI_ARB_ON_OFF (or any other
>> cmd) and would keep running / expire. We could call
>> emit_stop_watchdog unconditionally to prevent this.
>
> No, I was thinking of the opposite where we had preemption after the
> batch. Completely missed the point of the watchdog being abled for the
> low priority batch then being inherited by the high priority batch - and
> vice versa that the watchdog counter would not be restored on the
> context switch back. Does suggest that the watchdog should really be
> part of the context image...

RING_CNTR (0x2178) & RING_THRESH (0x217c) are part of the context image, 
but there's still the issue of the ctx restore being slower (or maybe 
it's a lite-restore).

And the 'counter' isn't part of the image; when the pre-empted batch 
resumes, the counter will re-start from 0.