[Intel-gfx] [PATCH] drm/i915: fix exiting context timeout calculation

Thu Dec 1 10:28:39 UTC 2022

On 01/12/2022 00:22, John Harrison wrote:
> On 11/29/2022 00:43, Tvrtko Ursulin wrote:
>> On 28/11/2022 16:52, Andrzej Hajda wrote:
>>> In case context is exiting preempt_timeout_ms is used for timeout,
>>> but since introduction of DRM_I915_PREEMPT_TIMEOUT_COMPUTE it increases
>>> to 7.5 seconds. Heartbeat occurs earlier but it is still 2.5s.
>>>
>>> Fixes: d7a8680ec9fb21 ("drm/i915: Improve long running compute w/a 
>>> for GuC submission")
>>> Closes: https://gitlab.freedesktop.org/drm/intel/-/issues/2410
>>> Signed-off-by: Andrzej Hajda <andrzej.hajda at intel.com>
>>> ---
>>> Hi all,
>>>
>>> I am not sure what is expected solution here, and if my patch does not
>>> actually reverts intentions of patch d7a8680ec9fb21. Feel free to 
>>> propose
>>> something better.
>>> Other alternative would be to increase t/o in IGT tests, but I am not 
>>> sure
>>> if this is good direction.
>>
>> Is it the hack with the FIXME marker from 47daf84a8bfb ("drm/i915: 
>> Make the heartbeat play nice with long pre-emption timeouts") that 
>> actually breaks things? (If IGT modifies the preempt timeout the 
>> heartbeat extension will not work as intended.)
>>
>> If so, I think we agreed during review that was a weakness which needs 
>> to be addressed, but I would need to re-read the old threads to 
>> remember what was the plan. Regardless what it was it may be time is 
>> now to continue with those improvements.
>>
> What is the actual issue? Just that closing contexts are taking forever 
> to actually close? That would be the whole point of the 
> 'context_is_exiting' patch. Which I still totally disagree with.
> 
> If the context is being closed 'gracefully' and it is intended that it 
> should be allowed time to pre-empt without being killed via an engine 
> reset then the 7.5s delay is required. That is the officially agreed 
> upon timeout to allow compute capable contexts to reach a pre-emption 
> point before they should be killed. If an IGT is failing because it 
> enforces a shorter timeout then the IGT needs to be updated to account 
> for the fact that i915 has to support slow compute workloads.
> 
> If the context is being closed 'forcefully' and should be killed 
> immediately then you should be using the 'BANNED_PREEMPT_TIMEOUT' value 
> not the sysfs/config value.
> 
> Regarding heartbeats...
> 
> The heartbeat period is 2.5s. But there are up to five heartbeat periods 
> between the heartbeat starting and it declaring a hang. The patch you 
> mention also introduced a check on the pre-emption timeout when the last 
> period starts. If the pre-emption timeout is longer than the heartbeat 
> period then the last period is extended to guarantee that a full 
> pre-emption time is granted before declaring the hang.
> 
> Are you saying that a heartbeat timeout is occurring and killing the 
> system? Or are you just worried that something doesn't align correctly?

I leave this to Andrzej since I am not the one debugging this. I just glanced over the IGT and saw that there's code in there which sets both the preempt timeout and heartbeat interval to non-default values. And then I remembered this:

next_heartbeat():
...
         /*
          * FIXME: The final period extension is disabled if the period has been
          * modified from the default. This is to prevent issues with certain
          * selftests which override the value and expect specific behaviour.
          * Once the selftests have been updated to either cope with variable
          * heartbeat periods (or to override the pre-emption timeout as well,
          * or just to add a selftest specific override of the extension), the
          * generic override can be removed.
          */
         if (rq && rq->sched.attr.priority >= I915_PRIORITY_BARRIER &&
             delay == engine->defaults.heartbeat_interval_ms) {

Which then wouldn't dtrt with last heartbeat pulse extensions, if the IGT would be relying on that. Don't know, just pointing out to check and see if this FIXME needs to be prioritised.

Regards,

Tvrtko