[PATCH i-g-t] Bump aborting on network failure deadline to 40 seconds

Wed Feb 12 15:38:34 UTC 2025

On 12.02.2025 15:59, Kamil Konieczny wrote:
> Hi Peter,
> On 2025-02-11 at 10:55:37 +0100, Peter Senna Tschudin wrote:
>> Hi Kamil,
>>
>> Thank you for your review. Please see below.
>>
>> On 11.02.2025 10:21, Kamil Konieczny wrote:
>>> Hi Peter,
>>> On 2025-02-06 at 16:21:47 +0100, Peter Senna Tschudin wrote:
>>>> Commit ddfde25f16ba ("runner: Add support for aborting on network
>>>> failure") introduced a 20 second deadline for the DUT’s network
>>>> to recover after a suspend/resume cycle. If the network isn’t
>>>> back up within that time, igt_runner aborts the test run to save logs
>>>> and prevent potential log loss from an imminent power cycle.
>>>>
>>>> This deadline was set to accommodate our internal CI system, which
>>>> checks for DUT network connectivity every 5 seconds and retries up
>>>> to 3 times at 20 second intervals. If it fails 3 consecutive checks,
>>>
>>> This is a little confusing, you wrote in first paragraph about
>>> 20 second deadline and here it looks like 60 seconds (3*20).
>>
>> The first paragraph is explicitly about the deadline introduced by a
>> previous commit. The second paragraph is explicitly about the internal
>> CI mechanism that inspired the deadline. Can you explain what is
>> confusing?
>>
> 
> Using same value "20 seconds" confused me, sorry. imho
> it is better to write something like:
> 
> External monitoring machine checks DUTs health and if it
> cannot reach it over network for 60 seconds it will reboot DUT.
> 
> So no need to exactly describe how it is done, only what is
> happening and why runner needs to adapt to that.
> 
>>>
>>>> it triggers a power cycle on the DUT.
>>>>
>>>> Although our internal CI system can be configured with a longer
>>> -------------- ^^^^^^^^
>>> Remove this.
>>
>> No, why?
>>
> 
> Looks like it happened when Ci tested upstream commit:
> https://gitlab.freedesktop.org/drm/i915/kernel/-/issues/10030
> 
> and it is one year old, do we have anything new?
> 
>>>
>>>> wait time, extending it further would unnecessarily prolong tests
>>>> in cases of DUT hangs.
>>>>
>>>> Bumping the deadline to 40 seconds keeps the abort mechanism safely
>>>
>>> imho this should be option for igt-runner, I would prefer to not
>>> adjust it later, let CI team tune it. Option could be either time
>>> or retry counter or both.
>>
>> I disagree. We can add value now by potentially preventing premature
> 
> I would prefer actual case solved by this change, not 'potential' one.

Unfortunately we are dealing with an issue that is not deterministic. There
is no way to know how long it will take for the network to go up after a
suspend and resume cycle.

Our main assumption is that because 20 seconds is enough for the network to
come up after most suspend/resume cycles, doubling it should considerably
reduce premature aborts, solving the problem.

Unfortunately for cases where the network takes more than 40 seconds to go
up, this patch makes no difference. But then it is ok to abort.

> 
>> aborts with very little risk of creating any issue. The people in CC
>> seems to agree with that.
>>
>> This patch is as simple as it can get. I don't buy the benefit of the
>> extra complexity of adding yet another command line option. It is way
>> more effective to send one of these every 6 years or so.
> 
> Sorry, I didn't look into code, after reading part of it setting this
> is done not by runner options but by igtrc file or enviroment, see
> my reply to Ryszard. imho it is better to put these into configs/options
> then hard-coding into runner, we could keep old 20s as default.
> 
>>
>> Am I correct that this is a nack from you?
>>
> 
> No, I am just not convinced it will solve anything, do you have
> numbers? What are times needed for hibernate and resume? Because
> it will be time-in-hibernate + time-for-resume + time-for-ping-deadline
> and if this exceeds 60sec DUT will be rebooted.

Unfortunately CI does not track how long the network takes to go up after
a suspend and resume cycle.

Statistically speaking, we expect that bumping the deadline to 40 seconds
will reduce premature aborts meaningfully, solving the problem. We have an
asymmetric situation, which is good: having this patch has potential to
make a positive impact, while the patch risk is very low.

Sorry but nothing is exact here, still this patch makes sense.

> 
> Regards,
> Kamil
> 
>>>
>>>> within our internal CI system retry window while improving chances
>>>> of preventing a premature abort. For upstream testing on Jenkins,
>>>> the deadlines vary from 16 and 25 minutes, and this change has
>>>> no impact.
>>>>
>>>> CC: juha-pekka.heikkila at intel.com
>>>> CC: katarzyna.piecielska at intel.com
>>>> CC: ryszard.knop at intel.com
>>>> CC: ewelina.musial at intel.com
>>>> CC: adrinael at adrinael.net
>>>> CC: mateusz.grabski at intel.com
>>>> CC: konrad.b.brodzik at intel.com
>>>
>>> imho better here 'Cc:'
>>
>> Thank you!