[Intel-gfx] drm/i915: Watchdog timeout: IRQ handler for gen8+
Antonio Argenziano
antonio.argenziano at intel.com
Fri Jan 11 17:31:51 UTC 2019
On 11/01/19 00:22, Tvrtko Ursulin wrote:
>
> On 11/01/2019 00:47, Antonio Argenziano wrote:
>> On 07/01/19 08:58, Tvrtko Ursulin wrote:
>>> On 07/01/2019 13:57, Chris Wilson wrote:
>>>> Quoting Tvrtko Ursulin (2019-01-07 13:43:29)
>>>>>
>>>>> On 07/01/2019 11:58, Tvrtko Ursulin wrote:
>>>>>
>>>>> [snip]
>>>>>
>>>>>>> Note about future interaction with preemption: Preemption could
>>>>>>> happen
>>>>>>> in a command sequence prior to watchdog counter getting disabled,
>>>>>>> resulting in watchdog being triggered following preemption (e.g.
>>>>>>> when
>>>>>>> watchdog had been enabled in the low priority batch). The driver
>>>>>>> will
>>>>>>> need to explicitly disable the watchdog counter as part of the
>>>>>>> preemption sequence.
>>>>>>
>>>>>> Does the series take care of preemption?
>>>>>
>>>>> I did not find that it does.
>>>>
>>>> Oh. I hoped that the watchdog was saved as part of the context... Then
>>>> despite preemption, the timeout would resume from where we left off as
>>>> soon as it was back on the gpu.
>>>>
>>>> If the timeout remaining was context saved it would be much simpler (at
>>>> least on first glance), please say it is.
>>>
>>> I made my comments going only by the text from the commit message and
>>> the absence of any preemption special handling.
>>>
>>> Having read the spec, the situation seems like this:
>>>
>>> * Watchdog control and threshold register are context saved and
>>> restored.
>>>
>>> * On a context switch watchdog counter is reset to zero and
>>> automatically disabled until enabled by a context restore or explicitly.
>>>
>>> So it sounds the commit message could be wrong that special handling
>>> is needed from this direction. But read till the end on the
>>> restriction listed.
>>>
>>> * Watchdog counter is reset to zero and is not accumulated across
>>> multiple submission of the same context (due preemption).
>>>
>>> I read this as - after preemption contexts gets a new full timeout
>>> allocation. Or in other words, if a context is preempted N times,
>>> it's cumulative watchdog timeout will be N * set value.
>>>
>>> This could be theoretically exploitable to bypass the timeout. If a
>>> client sets up two contexts with prio -1 and -2, and keeps submitting
>>> periodical no-op batches against prio -1 context, while prio -2 is
>>> it's own hog, then prio -2 context defeats the watchdog timer. I
>>> think.. would appreciate is someone challenged this conclusion.
>>
>> I think you are right that is a possibility but, is that a problem?
>> The client can just not set the threshold to bypass the timeout. Also
>> because you need the hanging batch to be simply preemptible, you
>> cannot disrupt any work from another client that is higher priority.
>> This is
>
> But I think higher priority client can have the same effect on the lower
> priority purely by accident, no?
>
> As a real world example, user kicks off an background transcoding job,
> which happens to use prio -2, and uses the watchdog timer.
>
> At the same time user watches a video from a player of normal priority.
> This causes periodic, say 24Hz, preemption events, due frame decoding
> activity on the same engine as the transcoding client.
>
> Does this defeat the watchdog timer for the former is the question? Then
> the questions of can we do something about it and whether it really
> isn't a problem?
I guess it depends if you consider that timeout as the maximum lifespan
a workload can have or max contiguous active time.
>
> Maybe it is not disrupting higher priority clients but it is causing an
> time unbound power drain.
>
>> pretty much the same behavior of hangcheck IIRC so something we
>> already accept.
>
> You mean today hangcheck wouldn't notice a hanging batch in the same
> scenario as above? If so it sounds like a huge gap we need to try and fix.
My understanding of it is that we only keep a record of what was running
the last time hangcheck was run so it is possible to trick it into
resetting when a preemption occurs but I could be missing something.
>
>>>
>>> And finally there is one programming restriction which says:
>>>
>>> * SW must not preempt the workload which has watchdog enabled.
>>> Either it must:
>>>
>>> a) disable preemption for that workload completely, or
>>> b) disable the watchdog via mmio write before any write to ELSP
>>>
>>> This seems it contradiction with the statement that the counter gets
>>> disabled on context switch and stays disabled.
>>>
>>> I did not spot anything like this in the series. So it would seem the
>>> commit message is correct after all.
>>>
>>> It would be good if someone could re-read the bspec text on register
>>> 0x2178 to double check what I wrote.
>>
>> The way I read it is that the restriction applies only to some
>> platforms where the 'normal' description doesn't apply.
>
> You are right. Are the listed parts in the field so the series would
> have to handle this or we can ignore it?
I think there is something we need to handle e.g. BXT.
Antonio
>
> Regards,
>
> Tvrtko
More information about the Intel-gfx
mailing list