[Intel-gfx] a potential dead loop in intel_lrc_irq_handler

Chris Wilson chris at chris-wilson.co.uk
Mon Aug 7 09:56:21 UTC 2017


Quoting Dong, Chuanxiao (2017-08-07 10:41:29)
> Hello,
> 
> Found there might be a corner case for intel_lrc_irq_handler() in a dead loop, want to understand if this can be real or not.
> 
> The scenario is like:

> 1. Write wedged to trigger a GPU reset;

This is dangerous full stop, but even with a hangcheck the scenario is
still plausible.

> 2. meanwhile, there is one ongoing request in port[0], and its context switch interrupt is generated from HW;
> 3. as interrupt line is disabled during GPU reset, it is possible that this interrupt is not handled by intel_lrc_irq_handler();
> 4. during GPU reset, the CSB tail is reset to 0x7 which is a default value;

In theory, yes. This prevents the delayed context switch interrupt from
having any meaning.

> 5. i915 try to replay this request during GPU reset;

If the context-switch occurred (but still pending in IIR), the request is
complete, it will not be replayed.

> 6. GPU reset completed;
> 7. handling the pending interrupt of the step#2.
> 
> Normally as in step#5 driver wrote the ELSP and replayed a request so the CSB tail should be updated to 0 in step#7. But if the CSB tail updating is not that quick, in step#7 when handling the last pending interrupt the CSB tail is still 0x7, the intel_lrc_irq_handler() will be in a dead loop then.
> 
> If the CSB tail updating is not synchronized with the ELSP writing then my understanding is that it is possible to encounter this corner case. If so, shall we clear the pending interrupts in IIR during i915_reset? Please correct me if anything wrong.

The CSB buf+tail is synchronized to the interrupt. Our goal is to make
sure that the GPU is truly reset before we reset our state tracking so
that we don't have pending events on replay.

However, the CSB itself is a little bit of a black box as it is
squirreled away in a power context on reset, and it is only with a bit
of handwaving that it is reset to a default empty value on reset.

CSB interrupt -> pending
GPU reset -> clears CSB head/tail
post-reset, re-enable interrupts, raise CSB interrupt
-> intel_lrc_irq_handler()
	if (CSB_head == CSB_tail)
		break;

Should be no problem. Similarly for a delayed tasklet, we haven't posted
the CSB interrupt and so we don't even read the CSB_head/tail as they as
still undefined (prior to the first CSB interrupt).
-Chris


More information about the Intel-gfx mailing list