[Intel-gfx] [PATCH v6 16/20] drm/i915: Watchdog timeout: IRQ handler for gen8+

Michel Thierry michel.thierry at intel.com
Wed Apr 19 17:11:37 UTC 2017



On 19/04/17 03:20, Chris Wilson wrote:
> On Tue, Apr 18, 2017 at 01:23:31PM -0700, Michel Thierry wrote:
>> *** General ***
>>
>> Watchdog timeout (or "media engine reset") is a feature that allows
>> userland applications to enable hang detection on individual batch buffers.
>> The detection mechanism itself is mostly bound to the hardware and the only
>> thing that the driver needs to do to support this form of hang detection
>> is to implement the interrupt handling support as well as watchdog command
>> emission before and after the emitted batch buffer start instruction in the
>> ring buffer.
>>
>> The principle of the hang detection mechanism is as follows:
>>
>> 1. Once the decision has been made to enable watchdog timeout for a
>> particular batch buffer and the driver is in the process of emitting the
>> batch buffer start instruction into the ring buffer it also emits a
>> watchdog timer start instruction before and a watchdog timer cancellation
>> instruction after the batch buffer start instruction in the ring buffer.
>>
>> 2. Once the GPU execution reaches the watchdog timer start instruction
>> the hardware watchdog counter is started by the hardware. The counter
>> keeps counting until either reaching a previously configured threshold
>> value or the timer cancellation instruction is executed.
>>
>> 2a. If the counter reaches the threshold value the hardware fires a
>> watchdog interrupt that is picked up by the watchdog interrupt handler.
>> This means that a hang has been detected and the driver needs to deal with
>> it the same way it would deal with a engine hang detected by the periodic
>> hang checker. The only difference between the two is that we already blamed
>> the active request (to ensure an engine reset).
>>
>> 2b. If the batch buffer completes and the execution reaches the watchdog
>> cancellation instruction before the watchdog counter reaches its
>> threshold value the watchdog is cancelled and nothing more comes of it.
>> No hang is detected.
>>
>> Note about future interaction with preemption: Preemption could happen
>> in a command sequence prior to watchdog counter getting disabled,
>> resulting in watchdog being triggered following preemption. The driver will
>> need to explicitly disable the watchdog counter as part of the
>> preemption sequence.
>
> Does MI_ARB_ON_OFF do the trick? Shouldn't we basically be only turning
> preemption on for the user buffers as it just causes hassle if we allow
> preemption in our preamble + breadcrumb. (And there's little point in
> preempting in the flushes.)
>

Mid-batch?
The watchdog counter is not aware of MI_ARB_ON_OFF (or any other cmd) 
and would keep running / expire. We could call emit_stop_watchdog 
unconditionally to prevent this.

>> *** This patch introduces: ***
>>
>> 1. IRQ handler code for watchdog timeout allowing direct hang recovery
>> based on hardware-driven hang detection, which then integrates directly
>> with the hang recovery path. This is independent of having per-engine reset
>> or just full gpu reset.
>>
>> 2. Watchdog specific register information.
>>
>> Currently the render engine and all available media engines support
>> watchdog timeout (VECS is only supported in GEN9). The specifications elude
>> to the BCS engine being supported but that is currently not supported by
>> this commit.
>>
>> Note that the value to stop the counter is different between render and
>> non-render engines in GEN8; GEN9 onwards it's the same.
>
> Should mention the choice to piggyback the current hangcheck + capture
> scheme.
>
>> +	if (iir & (GT_GEN8_WATCHDOG_INTERRUPT << test_shift)) {
>> +		tasklet_schedule(&engine->watchdog_tasklet);
>> +	}
>
> Kill unwanted braces.
>
>> +#define GEN8_WATCHDOG_1000US 0x2ee0 //XXX: Temp, replace with helper function
>> +static void gen8_watchdog_irq_handler(unsigned long data)
>> +{
>> +	struct intel_engine_cs *engine = (struct intel_engine_cs *)data;
>> +	struct drm_i915_private *dev_priv = engine->i915;
>> +	u32 current_seqno;
>> +
>> +	intel_uncore_forcewake_get(dev_priv, engine->fw_domains);
>> +
>> +	/* Stop the counter to prevent further timeout interrupts */
>> +	I915_WRITE_FW(RING_CNTR(engine->mmio_base), get_watchdog_disable(engine));
>> +
>> +	current_seqno = intel_engine_get_seqno(engine);
>> +
>> +	/* did the request complete after the timer expired? */
>> +	if (intel_engine_last_submit(engine) == current_seqno)
>> +		goto fw_put;
>> +
>> +	if (engine->hangcheck.watchdog == current_seqno) {
>> +		/* Make sure the active request will be marked as guilty */
>> +		engine->hangcheck.stalled = true;
>> +		engine->hangcheck.seqno = intel_engine_get_seqno(engine);
>
> Use current_seqno again. intel_engine_get_seqno() may have just changed.
> -Chris
>


More information about the Intel-gfx mailing list