[Intel-gfx] drm/i915: Watchdog timeout: IRQ handler for gen8+

Fri Jan 11 00:47:17 UTC 2019

On 07/01/19 08:58, Tvrtko Ursulin wrote:
> 
> On 07/01/2019 13:57, Chris Wilson wrote:
>> Quoting Tvrtko Ursulin (2019-01-07 13:43:29)
>>>
>>> On 07/01/2019 11:58, Tvrtko Ursulin wrote:
>>>
>>> [snip]
>>>
>>>>> Note about future interaction with preemption: Preemption could happen
>>>>> in a command sequence prior to watchdog counter getting disabled,
>>>>> resulting in watchdog being triggered following preemption (e.g. when
>>>>> watchdog had been enabled in the low priority batch). The driver will
>>>>> need to explicitly disable the watchdog counter as part of the
>>>>> preemption sequence.
>>>>
>>>> Does the series take care of preemption?
>>>
>>> I did not find that it does.
>>
>> Oh. I hoped that the watchdog was saved as part of the context... Then
>> despite preemption, the timeout would resume from where we left off as
>> soon as it was back on the gpu.
>>
>> If the timeout remaining was context saved it would be much simpler (at
>> least on first glance), please say it is.
> 
> I made my comments going only by the text from the commit message and 
> the absence of any preemption special handling.
> 
> Having read the spec, the situation seems like this:
> 
>   * Watchdog control and threshold register are context saved and restored.
> 
>   * On a context switch watchdog counter is reset to zero and 
> automatically disabled until enabled by a context restore or explicitly.
> 
> So it sounds the commit message could be wrong that special handling is 
> needed from this direction. But read till the end on the restriction 
> listed.
> 
>   * Watchdog counter is reset to zero and is not accumulated across 
> multiple submission of the same context (due preemption).
> 
> I read this as - after preemption contexts gets a new full timeout 
> allocation. Or in other words, if a context is preempted N times, it's 
> cumulative watchdog timeout will be N * set value.
> 
> This could be theoretically exploitable to bypass the timeout. If a 
> client sets up two contexts with prio -1 and -2, and keeps submitting 
> periodical no-op batches against prio -1 context, while prio -2 is it's 
> own hog, then prio -2 context defeats the watchdog timer. I think.. 
> would appreciate is someone challenged this conclusion.

I think you are right that is a possibility but, is that a problem? The 
client can just not set the threshold to bypass the timeout. Also 
because you need the hanging batch to be simply preemptible, you cannot 
disrupt any work from another client that is higher priority. This is 
pretty much the same behavior of hangcheck IIRC so something we already 
accept.

> 
> And finally there is one programming restriction which says:
> 
>   * SW must not preempt the workload which has watchdog enabled. Either 
> it must:
> 
> a) disable preemption for that workload completely, or
> b) disable the watchdog via mmio write before any write to ELSP
> 
> This seems it contradiction with the statement that the counter gets 
> disabled on context switch and stays disabled.
> 
> I did not spot anything like this in the series. So it would seem the 
> commit message is correct after all.
> 
> It would be good if someone could re-read the bspec text on register 
> 0x2178 to double check what I wrote.

The way I read it is that the restriction applies only to some platforms 
where the 'normal' description doesn't apply.

Antonio

> 
> Regards,
> 
> Tvrtko
> _______________________________________________
> Intel-gfx mailing list
> Intel-gfx at lists.freedesktop.org
> https://lists.freedesktop.org/mailman/listinfo/intel-gfx