[Intel-gfx] drm/i915: Watchdog timeout: IRQ handler for gen8+

Fri Jan 11 21:28:07 UTC 2019

On 1/11/2019 09:31, Antonio Argenziano wrote:
>
> On 11/01/19 00:22, Tvrtko Ursulin wrote:
>>
>> On 11/01/2019 00:47, Antonio Argenziano wrote:
>>> On 07/01/19 08:58, Tvrtko Ursulin wrote:
>>>> On 07/01/2019 13:57, Chris Wilson wrote:
>>>>> Quoting Tvrtko Ursulin (2019-01-07 13:43:29)
>>>>>>
>>>>>> On 07/01/2019 11:58, Tvrtko Ursulin wrote:
>>>>>>
>>>>>> [snip]
>>>>>>
>>>>>>>> Note about future interaction with preemption: Preemption could 
>>>>>>>> happen
>>>>>>>> in a command sequence prior to watchdog counter getting disabled,
>>>>>>>> resulting in watchdog being triggered following preemption 
>>>>>>>> (e.g. when
>>>>>>>> watchdog had been enabled in the low priority batch). The 
>>>>>>>> driver will
>>>>>>>> need to explicitly disable the watchdog counter as part of the
>>>>>>>> preemption sequence.
>>>>>>>
>>>>>>> Does the series take care of preemption?
>>>>>>
>>>>>> I did not find that it does.
>>>>>
>>>>> Oh. I hoped that the watchdog was saved as part of the context... 
>>>>> Then
>>>>> despite preemption, the timeout would resume from where we left 
>>>>> off as
>>>>> soon as it was back on the gpu.
>>>>>
>>>>> If the timeout remaining was context saved it would be much 
>>>>> simpler (at
>>>>> least on first glance), please say it is.
>>>>
>>>> I made my comments going only by the text from the commit message 
>>>> and the absence of any preemption special handling.
>>>>
>>>> Having read the spec, the situation seems like this:
>>>>
>>>>   * Watchdog control and threshold register are context saved and 
>>>> restored.
>>>>
>>>>   * On a context switch watchdog counter is reset to zero and 
>>>> automatically disabled until enabled by a context restore or 
>>>> explicitly.
>>>>
>>>> So it sounds the commit message could be wrong that special 
>>>> handling is needed from this direction. But read till the end on 
>>>> the restriction listed.
>>>>
>>>>   * Watchdog counter is reset to zero and is not accumulated across 
>>>> multiple submission of the same context (due preemption).
>>>>
>>>> I read this as - after preemption contexts gets a new full timeout 
>>>> allocation. Or in other words, if a context is preempted N times, 
>>>> it's cumulative watchdog timeout will be N * set value.
>>>>
>>>> This could be theoretically exploitable to bypass the timeout. If a 
>>>> client sets up two contexts with prio -1 and -2, and keeps 
>>>> submitting periodical no-op batches against prio -1 context, while 
>>>> prio -2 is it's own hog, then prio -2 context defeats the watchdog 
>>>> timer. I think.. would appreciate is someone challenged this 
>>>> conclusion.
>>>
>>> I think you are right that is a possibility but, is that a problem? 
>>> The client can just not set the threshold to bypass the timeout. 
>>> Also because you need the hanging batch to be simply preemptible, 
>>> you cannot disrupt any work from another client that is higher 
>>> priority. This is 
>>
>> But I think higher priority client can have the same effect on the 
>> lower priority purely by accident, no?
>>
>> As a real world example, user kicks off an background transcoding 
>> job, which happens to use prio -2, and uses the watchdog timer.
>>
>> At the same time user watches a video from a player of normal 
>> priority. This causes periodic, say 24Hz, preemption events, due 
>> frame decoding activity on the same engine as the transcoding client.
>>
>> Does this defeat the watchdog timer for the former is the question? 
>> Then the questions of can we do something about it and whether it 
>> really isn't a problem?
>
> I guess it depends if you consider that timeout as the maximum 
> lifespan a workload can have or max contiguous active time.

I believe the intended purpose of the watchdog is to prevent broken 
bitstreams hanging the transcoder/player. That is, it is a form of error 
detection used by the media driver to handle bad user input. So if there 
is a way for the watchdog to be extended indefinitely under normal 
situations, that would be a problem. It means the transcoder will not 
detect the broken input data in a timely manner and effectively hang 
rather than skip over to the next packet. And note that broken input 
data can be caused by something as innocent as a dropped packet due to 
high network contention. No need for any malicious activity at all.

John.