[PATCH i-g-t] Bump aborting on network failure deadline to 40 seconds

Peter Senna Tschudin peter.senna at linux.intel.com
Tue Feb 11 13:38:27 UTC 2025


Hi Kasia,

On 11.02.2025 12:59, Piecielska, Katarzyna wrote:
> Hello Peter
> 
> 
> 
> Hi Kamil,
> 
> Thank you for your review. Please see below.
> 
> On 11.02.2025 10:21, Kamil Konieczny wrote:
>> Hi Peter,
>> On 2025-02-06 at 16:21:47 +0100, Peter Senna Tschudin wrote:
>>> Commit ddfde25f16ba ("runner: Add support for aborting on network
>>> failure") introduced a 20 second deadline for the DUT’s network to 
>>> recover after a suspend/resume cycle. If the network isn’t back up 
>>> within that time, igt_runner aborts the test run to save logs and 
>>> prevent potential log loss from an imminent power cycle.
>>>
>>> This deadline was set to accommodate our internal CI system, which 
>>> checks for DUT network connectivity every 5 seconds and retries up to 
>>> 3 times at 20 second intervals. If it fails 3 consecutive checks,
>>
>> This is a little confusing, you wrote in first paragraph about
>> 20 second deadline and here it looks like 60 seconds (3*20).
> 
> The first paragraph is explicitly about the deadline introduced by a previous commit. The second paragraph is explicitly about the internal CI mechanism that inspired the deadline. Can you explain what is confusing?
> 
>>
>>> it triggers a power cycle on the DUT.
>>>
>>> Although our internal CI system can be configured with a longer
>> -------------- ^^^^^^^^
>> Remove this.
> 
> No, why?
> 
> 
> Kasia -> This is upstream review, no need to add word 'internal'.

This entire discussion is about safety mechanisms from our internal CI.
Even the commit that I talk about, ddfde25f16ba ("runner: Add support for
aborting on network failure") was created to work well with our internal
CI.

What is the problem?

> 
>>
>>> wait time, extending it further would unnecessarily prolong tests in 
>>> cases of DUT hangs.
>>>
>>> Bumping the deadline to 40 seconds keeps the abort mechanism safely
>>
>> imho this should be option for igt-runner, I would prefer to not 
>> adjust it later, let CI team tune it. Option could be either time or 
>> retry counter or both.
> 
> I disagree. We can add value now by potentially preventing premature aborts with very little risk of creating any issue. The people in CC seems to agree with that.
> 
> This patch is as simple as it can get. I don't buy the benefit of the extra complexity of adding yet another command line option. It is way more effective to send one of these every 6 years or so.
> 
> Am I correct that this is a nack from you?
> 
> This looks like we need a good discussion with CI team. @Knop, Ryszard @Grabski, Mateusz can you comment on this approach?

What is not clear? Not sure there is any open here.

Thanks,

Peter


More information about the igt-dev mailing list