[Intel-gfx] drm/i915: Watchdog timeout: IRQ handler for gen8+

Mon Jan 7 12:58:39 UTC 2019

On 07/01/2019 12:16, Chris Wilson wrote:
> Quoting Tvrtko Ursulin (2019-01-07 11:58:13)
>>
>> Hi,
>>
>> This series has not been recognized by Patchwork as such, nor are the
>> patches numbered. Have you used git format-patch -<N> --cover-letter and
>> git send-email to send it out?
>>
>> Rest inline.
>>
>> On 05/01/2019 02:39, Carlos Santa wrote:
>>> +static void gen8_watchdog_irq_handler(unsigned long data)
>>> +{
>>> +     struct intel_engine_cs *engine = (struct intel_engine_cs *)data;
>>> +     struct drm_i915_private *dev_priv = engine->i915;
>>> +     enum forcewake_domains fw_domains;
>>> +     u32 current_seqno;
>>> +
>>> +     switch (engine->id) {
>>> +     default:
>>> +             MISSING_CASE(engine->id);
>>> +             /* fall through */
>>> +     case RCS:
>>> +             fw_domains = FORCEWAKE_RENDER;
>>> +             break;
>>> +     case VCS:
>>> +     case VCS2:
>>> +     case VECS:
>>> +             fw_domains = FORCEWAKE_MEDIA;
>>> +             break;
>>> +     }
>>> +
>>> +     intel_uncore_forcewake_get(dev_priv, fw_domains);
>>
>> I'd be tempted to drop this and just use I915_WRITE. It doesn't feel
>> like there is any performance to be gained with it and it embeds too
>> much knowledge here.
> 
> No, no, no. Let's not reintroduce a fw inside irq context on a frequent
> timer again.

Tasklet and hopefully watchdog timeouts are not frequent. :)

> Rule of thumb for fw_get:
> gen6+: 10us to 50ms.
> gen8+: 10us to 500us.
> 
> And then we don't release fw for 1ms after the fw_put. So we basically
> prevent GT powersaving while the watchdog is active. That strikes me as
> hopefully an unintended consequence.
> 
> The fw_get will be required if we actually hang, but for the timer
> check, we should be able to do without.

That would be nice, but it is needed by the watchdog disable/re-enable 
logic. Which I commented looks suspect to me so maybe something can be 
done about that.

But in general, I didn't quite get if you are opposed to my suggestion 
not to open code knowledge of fw domains here in favour of simple 
I915_WRITE, or just the whole concept of taking a fw by any means here.

> And while on the topic of the timer irq, it should be forcibly cleared
> along intel_engine_park, so that we ensure it is not raised while the
> device/driver is supposed to be asleep. Or something to that effect.

I have raised the issue of syncing the potentially delayed tasklet, but 
yeah, could be that more is needed.

>>> +     current_seqno = intel_engine_get_seqno(engine);
>>> +
>>> +     /* did the request complete after the timer expired? */
>>> +     if (intel_engine_last_submit(engine) == current_seqno)
>>> +             goto fw_put;
>>> +
>>> +     if (engine->hangcheck.watchdog == current_seqno) {
>>> +             /* Make sure the active request will be marked as guilty */
>>> +             engine->hangcheck.stalled = true;
>>> +             engine->hangcheck.acthd = intel_engine_get_active_head(engine);
>>> +             engine->hangcheck.seqno = current_seqno;
>>> +
>>> +             /* And try to run the hangcheck_work as soon as possible */
>>> +             set_bit(I915_RESET_WATCHDOG, &dev_priv->gpu_error.flags);
>>> +             queue_delayed_work(system_long_wq,
>>> +                                &dev_priv->gpu_error.hangcheck_work,
>>> +                                round_jiffies_up_relative(HZ));
>>> +     } else {
>>> +             engine->hangcheck.watchdog = current_seqno;
>>
>> The logic above potentially handles my previous question? Could be if
>> batch 2 hangs. But..
> 
> Also, DO NOT USE HANGCHECK for this. The whole design was to be able to
> do the engine reset right away. (Now guc can't but that's known broken.)
> 
> Aside, we have to rewrite this entire logic anyway as the engine seqno
> and global_seqno are obsolete.

Btw one thing I forgot to say - I did not focus on the hangcheck 
interactions - I'll leave that for people more in the know of that.

Regards,

Tvrtko