[PATCH i-g-t] Bump aborting on network failure deadline to 40 seconds

Thu Feb 13 13:37:44 UTC 2025

Hi Peter,
On 2025-02-12 at 16:38:34 +0100, Peter Senna Tschudin wrote:
> 

One nit I forgot to add, in subject there should be 'runner: '
prefix, just like in commit you cite in beginning.

In conclusion, are you ok with me aditing 2nd paragraph, dropping
4th paragraph where you wrote about internal-upstream CI,
dropping 'internal' word plus adding 'runner: ' prefix in subject?
If yes I could merge it.

Regards,
Kamil

> 
> On 12.02.2025 15:59, Kamil Konieczny wrote:
> > Hi Peter,
> > On 2025-02-11 at 10:55:37 +0100, Peter Senna Tschudin wrote:
> >> Hi Kamil,
> >>
> >> Thank you for your review. Please see below.
> >>
> >> On 11.02.2025 10:21, Kamil Konieczny wrote:
> >>> Hi Peter,
> >>> On 2025-02-06 at 16:21:47 +0100, Peter Senna Tschudin wrote:
> >>>> Commit ddfde25f16ba ("runner: Add support for aborting on network
> >>>> failure") introduced a 20 second deadline for the DUT’s network
> >>>> to recover after a suspend/resume cycle. If the network isn’t
> >>>> back up within that time, igt_runner aborts the test run to save logs
> >>>> and prevent potential log loss from an imminent power cycle.
> >>>>
> >>>> This deadline was set to accommodate our internal CI system, which
> >>>> checks for DUT network connectivity every 5 seconds and retries up
> >>>> to 3 times at 20 second intervals. If it fails 3 consecutive checks,
> >>>
> >>> This is a little confusing, you wrote in first paragraph about
> >>> 20 second deadline and here it looks like 60 seconds (3*20).
> >>
> >> The first paragraph is explicitly about the deadline introduced by a
> >> previous commit. The second paragraph is explicitly about the internal
> >> CI mechanism that inspired the deadline. Can you explain what is
> >> confusing?
> >>
> > 
> > Using same value "20 seconds" confused me, sorry. imho
> > it is better to write something like:
> > 
> > External monitoring machine checks DUTs health and if it
> > cannot reach it over network for 60 seconds it will reboot DUT.
> > 
> > So no need to exactly describe how it is done, only what is
> > happening and why runner needs to adapt to that.
> > 
> >>>
> >>>> it triggers a power cycle on the DUT.
> >>>>
> >>>> Although our internal CI system can be configured with a longer
> >>> -------------- ^^^^^^^^
> >>> Remove this.
> >>
> >> No, why?
> >>
> > 
> > Looks like it happened when Ci tested upstream commit:
> > https://gitlab.freedesktop.org/drm/i915/kernel/-/issues/10030
> > 
> > and it is one year old, do we have anything new?
> > 
> >>>
> >>>> wait time, extending it further would unnecessarily prolong tests
> >>>> in cases of DUT hangs.
> >>>>
> >>>> Bumping the deadline to 40 seconds keeps the abort mechanism safely
> >>>
> >>> imho this should be option for igt-runner, I would prefer to not
> >>> adjust it later, let CI team tune it. Option could be either time
> >>> or retry counter or both.
> >>
> >> I disagree. We can add value now by potentially preventing premature
> > 
> > I would prefer actual case solved by this change, not 'potential' one.
> 
> Unfortunately we are dealing with an issue that is not deterministic. There
> is no way to know how long it will take for the network to go up after a
> suspend and resume cycle.
> 
> Our main assumption is that because 20 seconds is enough for the network to
> come up after most suspend/resume cycles, doubling it should considerably
> reduce premature aborts, solving the problem.
> 
> Unfortunately for cases where the network takes more than 40 seconds to go
> up, this patch makes no difference. But then it is ok to abort.
> 
> > 
> >> aborts with very little risk of creating any issue. The people in CC
> >> seems to agree with that.
> >>
> >> This patch is as simple as it can get. I don't buy the benefit of the
> >> extra complexity of adding yet another command line option. It is way
> >> more effective to send one of these every 6 years or so.
> > 
> > Sorry, I didn't look into code, after reading part of it setting this
> > is done not by runner options but by igtrc file or enviroment, see
> > my reply to Ryszard. imho it is better to put these into configs/options
> > then hard-coding into runner, we could keep old 20s as default.
> > 
> >>
> >> Am I correct that this is a nack from you?
> >>
> > 
> > No, I am just not convinced it will solve anything, do you have
> > numbers? What are times needed for hibernate and resume? Because
> > it will be time-in-hibernate + time-for-resume + time-for-ping-deadline
> > and if this exceeds 60sec DUT will be rebooted.
> 
> Unfortunately CI does not track how long the network takes to go up after
> a suspend and resume cycle.
> 
> Statistically speaking, we expect that bumping the deadline to 40 seconds
> will reduce premature aborts meaningfully, solving the problem. We have an
> asymmetric situation, which is good: having this patch has potential to
> make a positive impact, while the patch risk is very low.
> 
> Sorry but nothing is exact here, still this patch makes sense.
> 
> > 
> > Regards,
> > Kamil
> > 
> >>>
> >>>> within our internal CI system retry window while improving chances
> >>>> of preventing a premature abort. For upstream testing on Jenkins,
> >>>> the deadlines vary from 16 and 25 minutes, and this change has
> >>>> no impact.
> >>>>
> >>>> CC: juha-pekka.heikkila at intel.com
> >>>> CC: katarzyna.piecielska at intel.com
> >>>> CC: ryszard.knop at intel.com
> >>>> CC: ewelina.musial at intel.com
> >>>> CC: adrinael at adrinael.net
> >>>> CC: mateusz.grabski at intel.com
> >>>> CC: konrad.b.brodzik at intel.com
> >>>
> >>> imho better here 'Cc:'
> >>
> >> Thank you!
>