[PATCH i-g-t] Bump aborting on network failure deadline to 40 seconds

Wed Feb 12 14:31:16 UTC 2025

On Wed, 2025-02-12 at 14:52 +0100, Kamil Konieczny wrote:
> Hi Knop,,
> On 2025-02-12 at 10:06:47 +0000, Knop, Ryszard wrote:
> > On Tue, 2025-02-11 at 11:59 +0000, Piecielska, Katarzyna wrote:
> > > Hello Peter
> > > 
> > > 
> > > 
> > > Hi Kamil,
> > > 
> > > Thank you for your review. Please see below.
> > > 
> > > On 11.02.2025 10:21, Kamil Konieczny wrote:
> > > > Hi Peter,
> > > > On 2025-02-06 at 16:21:47 +0100, Peter Senna Tschudin wrote:
> > > > > Commit ddfde25f16ba ("runner: Add support for aborting on network
> > > > > failure") introduced a 20 second deadline for the DUT’s network to 
> > > > > recover after a suspend/resume cycle. If the network isn’t back up 
> > > > > within that time, igt_runner aborts the test run to save logs and 
> > > > > prevent potential log loss from an imminent power cycle.
> > > > > 
> [...]
> 
> > > > > wait time, extending it further would unnecessarily prolong tests in 
> > > > > cases of DUT hangs.
> > > > > 
> > > > > Bumping the deadline to 40 seconds keeps the abort mechanism safely
> 
> [...]
> 
> > > > 
> > > > imho this should be option for igt-runner, I would prefer to not 
> > > > adjust it later, let CI team tune it. Option could be either time or 
> > > > retry counter or both.
> > > 
> > > I disagree. We can add value now by potentially preventing premature aborts with very little risk of creating any issue. The people in CC seems to agree with that.
> 
> I want to avoid bumping it to lower or bigger value later
> in future, see my proposition below.
> 
> > > 
> > > This patch is as simple as it can get. I don't buy the benefit of the extra complexity of adding yet another command line option. It is way more effective to send one of these every 6 years or so.
> > > 
> > > Am I correct that this is a nack from you?
> > > 
> > > This looks like we need a good discussion with CI team. @Knop, Ryszard @Grabski, Mateusz can you comment on this approach?
> > 
> > - If the current value is good and the ping timeout causes machines to
> > take 20 seconds longer on each broken run, it's just 20 seconds.
> > Compared to everything else we do suboptimally, it costs maybe a few
> > minutes on a bad day, so it's not a problem if it's overshot.
> > - If the current value is too low, runs take longer, but on the other
> > hand we get less broken runs and less noise - I would prefer this
> > instead of spurious aborts.
> > 
> > IMO it's fine to bump this to 40s.
> > 
> > Thanks, Ryszard
> > 
> 
> Can I take it as acked-by by you? Thanks!

Yes, this is:

Acked-by: Ryszard Knop <ryszard.knop at intel.com>

> I also have one more question, do we need this now or could
> we work longer and improve it? imho it is better to add one
> more way for CI to increase this value by CI team instead of
> here in code.

The patch changes a single digit in the source code, on the line that
was added, and last changed, in late 2019. It's really fine just to
change this here and be done with it. Less items in the already years-
long backlog please, not more :)

Ryszard

> I mean we could add to runner one more env var IGT_PING_DEADLINE
> and to igtrc DUT section PingHostDeadline, what you think?
> 
> Regards,
> Kamil
> 
> > > Thanks,
> > > Kasia
> > > 
> > > > 
> > > > > within our internal CI system retry window while improving chances of 
> > > > > preventing a premature abort. For upstream testing on Jenkins, the 
> > > > > deadlines vary from 16 and 25 minutes, and this change has no impact.
> > > > > 
> > > > > CC: juha-pekka.heikkila at intel.com
> > > > > CC: katarzyna.piecielska at intel.com
> > > > > CC: ryszard.knop at intel.com
> > > > > CC: ewelina.musial at intel.com
> > > > > CC: adrinael at adrinael.net
> > > > > CC: mateusz.grabski at intel.com
> > > > > CC: konrad.b.brodzik at intel.com
> > > > 
> > > > imho better here 'Cc:'
> > > 
> > > Thank you!
> >