[Intel-gfx] drm/i915: Watchdog timeout: IRQ handler for gen8+
Tvrtko Ursulin
tvrtko.ursulin at linux.intel.com
Mon Jan 7 12:58:39 UTC 2019
On 07/01/2019 12:16, Chris Wilson wrote:
> Quoting Tvrtko Ursulin (2019-01-07 11:58:13)
>>
>> Hi,
>>
>> This series has not been recognized by Patchwork as such, nor are the
>> patches numbered. Have you used git format-patch -<N> --cover-letter and
>> git send-email to send it out?
>>
>> Rest inline.
>>
>> On 05/01/2019 02:39, Carlos Santa wrote:
>>> +static void gen8_watchdog_irq_handler(unsigned long data)
>>> +{
>>> + struct intel_engine_cs *engine = (struct intel_engine_cs *)data;
>>> + struct drm_i915_private *dev_priv = engine->i915;
>>> + enum forcewake_domains fw_domains;
>>> + u32 current_seqno;
>>> +
>>> + switch (engine->id) {
>>> + default:
>>> + MISSING_CASE(engine->id);
>>> + /* fall through */
>>> + case RCS:
>>> + fw_domains = FORCEWAKE_RENDER;
>>> + break;
>>> + case VCS:
>>> + case VCS2:
>>> + case VECS:
>>> + fw_domains = FORCEWAKE_MEDIA;
>>> + break;
>>> + }
>>> +
>>> + intel_uncore_forcewake_get(dev_priv, fw_domains);
>>
>> I'd be tempted to drop this and just use I915_WRITE. It doesn't feel
>> like there is any performance to be gained with it and it embeds too
>> much knowledge here.
>
> No, no, no. Let's not reintroduce a fw inside irq context on a frequent
> timer again.
Tasklet and hopefully watchdog timeouts are not frequent. :)
> Rule of thumb for fw_get:
> gen6+: 10us to 50ms.
> gen8+: 10us to 500us.
>
> And then we don't release fw for 1ms after the fw_put. So we basically
> prevent GT powersaving while the watchdog is active. That strikes me as
> hopefully an unintended consequence.
>
> The fw_get will be required if we actually hang, but for the timer
> check, we should be able to do without.
That would be nice, but it is needed by the watchdog disable/re-enable
logic. Which I commented looks suspect to me so maybe something can be
done about that.
But in general, I didn't quite get if you are opposed to my suggestion
not to open code knowledge of fw domains here in favour of simple
I915_WRITE, or just the whole concept of taking a fw by any means here.
> And while on the topic of the timer irq, it should be forcibly cleared
> along intel_engine_park, so that we ensure it is not raised while the
> device/driver is supposed to be asleep. Or something to that effect.
I have raised the issue of syncing the potentially delayed tasklet, but
yeah, could be that more is needed.
>>> + current_seqno = intel_engine_get_seqno(engine);
>>> +
>>> + /* did the request complete after the timer expired? */
>>> + if (intel_engine_last_submit(engine) == current_seqno)
>>> + goto fw_put;
>>> +
>>> + if (engine->hangcheck.watchdog == current_seqno) {
>>> + /* Make sure the active request will be marked as guilty */
>>> + engine->hangcheck.stalled = true;
>>> + engine->hangcheck.acthd = intel_engine_get_active_head(engine);
>>> + engine->hangcheck.seqno = current_seqno;
>>> +
>>> + /* And try to run the hangcheck_work as soon as possible */
>>> + set_bit(I915_RESET_WATCHDOG, &dev_priv->gpu_error.flags);
>>> + queue_delayed_work(system_long_wq,
>>> + &dev_priv->gpu_error.hangcheck_work,
>>> + round_jiffies_up_relative(HZ));
>>> + } else {
>>> + engine->hangcheck.watchdog = current_seqno;
>>
>> The logic above potentially handles my previous question? Could be if
>> batch 2 hangs. But..
>
> Also, DO NOT USE HANGCHECK for this. The whole design was to be able to
> do the engine reset right away. (Now guc can't but that's known broken.)
>
> Aside, we have to rewrite this entire logic anyway as the engine seqno
> and global_seqno are obsolete.
Btw one thing I forgot to say - I did not focus on the hangcheck
interactions - I'll leave that for people more in the know of that.
Regards,
Tvrtko
More information about the Intel-gfx
mailing list