[PATCH i-g-t] Bump aborting on network failure deadline to 40 seconds
Peter Senna Tschudin
peter.senna at linux.intel.com
Tue Feb 11 13:38:27 UTC 2025
Hi Kasia,
On 11.02.2025 12:59, Piecielska, Katarzyna wrote:
> Hello Peter
>
>
>
> Hi Kamil,
>
> Thank you for your review. Please see below.
>
> On 11.02.2025 10:21, Kamil Konieczny wrote:
>> Hi Peter,
>> On 2025-02-06 at 16:21:47 +0100, Peter Senna Tschudin wrote:
>>> Commit ddfde25f16ba ("runner: Add support for aborting on network
>>> failure") introduced a 20 second deadline for the DUT’s network to
>>> recover after a suspend/resume cycle. If the network isn’t back up
>>> within that time, igt_runner aborts the test run to save logs and
>>> prevent potential log loss from an imminent power cycle.
>>>
>>> This deadline was set to accommodate our internal CI system, which
>>> checks for DUT network connectivity every 5 seconds and retries up to
>>> 3 times at 20 second intervals. If it fails 3 consecutive checks,
>>
>> This is a little confusing, you wrote in first paragraph about
>> 20 second deadline and here it looks like 60 seconds (3*20).
>
> The first paragraph is explicitly about the deadline introduced by a previous commit. The second paragraph is explicitly about the internal CI mechanism that inspired the deadline. Can you explain what is confusing?
>
>>
>>> it triggers a power cycle on the DUT.
>>>
>>> Although our internal CI system can be configured with a longer
>> -------------- ^^^^^^^^
>> Remove this.
>
> No, why?
>
>
> Kasia -> This is upstream review, no need to add word 'internal'.
This entire discussion is about safety mechanisms from our internal CI.
Even the commit that I talk about, ddfde25f16ba ("runner: Add support for
aborting on network failure") was created to work well with our internal
CI.
What is the problem?
>
>>
>>> wait time, extending it further would unnecessarily prolong tests in
>>> cases of DUT hangs.
>>>
>>> Bumping the deadline to 40 seconds keeps the abort mechanism safely
>>
>> imho this should be option for igt-runner, I would prefer to not
>> adjust it later, let CI team tune it. Option could be either time or
>> retry counter or both.
>
> I disagree. We can add value now by potentially preventing premature aborts with very little risk of creating any issue. The people in CC seems to agree with that.
>
> This patch is as simple as it can get. I don't buy the benefit of the extra complexity of adding yet another command line option. It is way more effective to send one of these every 6 years or so.
>
> Am I correct that this is a nack from you?
>
> This looks like we need a good discussion with CI team. @Knop, Ryszard @Grabski, Mateusz can you comment on this approach?
What is not clear? Not sure there is any open here.
Thanks,
Peter
More information about the igt-dev
mailing list