[Intel-gfx] drm/i915: Watchdog timeout: IRQ handler for gen8+
John Harrison
John.C.Harrison at Intel.com
Fri Jan 11 21:28:07 UTC 2019
On 1/11/2019 09:31, Antonio Argenziano wrote:
>
> On 11/01/19 00:22, Tvrtko Ursulin wrote:
>>
>> On 11/01/2019 00:47, Antonio Argenziano wrote:
>>> On 07/01/19 08:58, Tvrtko Ursulin wrote:
>>>> On 07/01/2019 13:57, Chris Wilson wrote:
>>>>> Quoting Tvrtko Ursulin (2019-01-07 13:43:29)
>>>>>>
>>>>>> On 07/01/2019 11:58, Tvrtko Ursulin wrote:
>>>>>>
>>>>>> [snip]
>>>>>>
>>>>>>>> Note about future interaction with preemption: Preemption could
>>>>>>>> happen
>>>>>>>> in a command sequence prior to watchdog counter getting disabled,
>>>>>>>> resulting in watchdog being triggered following preemption
>>>>>>>> (e.g. when
>>>>>>>> watchdog had been enabled in the low priority batch). The
>>>>>>>> driver will
>>>>>>>> need to explicitly disable the watchdog counter as part of the
>>>>>>>> preemption sequence.
>>>>>>>
>>>>>>> Does the series take care of preemption?
>>>>>>
>>>>>> I did not find that it does.
>>>>>
>>>>> Oh. I hoped that the watchdog was saved as part of the context...
>>>>> Then
>>>>> despite preemption, the timeout would resume from where we left
>>>>> off as
>>>>> soon as it was back on the gpu.
>>>>>
>>>>> If the timeout remaining was context saved it would be much
>>>>> simpler (at
>>>>> least on first glance), please say it is.
>>>>
>>>> I made my comments going only by the text from the commit message
>>>> and the absence of any preemption special handling.
>>>>
>>>> Having read the spec, the situation seems like this:
>>>>
>>>> * Watchdog control and threshold register are context saved and
>>>> restored.
>>>>
>>>> * On a context switch watchdog counter is reset to zero and
>>>> automatically disabled until enabled by a context restore or
>>>> explicitly.
>>>>
>>>> So it sounds the commit message could be wrong that special
>>>> handling is needed from this direction. But read till the end on
>>>> the restriction listed.
>>>>
>>>> * Watchdog counter is reset to zero and is not accumulated across
>>>> multiple submission of the same context (due preemption).
>>>>
>>>> I read this as - after preemption contexts gets a new full timeout
>>>> allocation. Or in other words, if a context is preempted N times,
>>>> it's cumulative watchdog timeout will be N * set value.
>>>>
>>>> This could be theoretically exploitable to bypass the timeout. If a
>>>> client sets up two contexts with prio -1 and -2, and keeps
>>>> submitting periodical no-op batches against prio -1 context, while
>>>> prio -2 is it's own hog, then prio -2 context defeats the watchdog
>>>> timer. I think.. would appreciate is someone challenged this
>>>> conclusion.
>>>
>>> I think you are right that is a possibility but, is that a problem?
>>> The client can just not set the threshold to bypass the timeout.
>>> Also because you need the hanging batch to be simply preemptible,
>>> you cannot disrupt any work from another client that is higher
>>> priority. This is
>>
>> But I think higher priority client can have the same effect on the
>> lower priority purely by accident, no?
>>
>> As a real world example, user kicks off an background transcoding
>> job, which happens to use prio -2, and uses the watchdog timer.
>>
>> At the same time user watches a video from a player of normal
>> priority. This causes periodic, say 24Hz, preemption events, due
>> frame decoding activity on the same engine as the transcoding client.
>>
>> Does this defeat the watchdog timer for the former is the question?
>> Then the questions of can we do something about it and whether it
>> really isn't a problem?
>
> I guess it depends if you consider that timeout as the maximum
> lifespan a workload can have or max contiguous active time.
I believe the intended purpose of the watchdog is to prevent broken
bitstreams hanging the transcoder/player. That is, it is a form of error
detection used by the media driver to handle bad user input. So if there
is a way for the watchdog to be extended indefinitely under normal
situations, that would be a problem. It means the transcoder will not
detect the broken input data in a timely manner and effectively hang
rather than skip over to the next packet. And note that broken input
data can be caused by something as innocent as a dropped packet due to
high network contention. No need for any malicious activity at all.
John.
More information about the Intel-gfx
mailing list