git and Marge troubles this week

Fri Jan 7 14:41:29 UTC 2022

On Fri, Jan 7, 2022 at 3:18 PM Connor Abbott <cwabbott0 at gmail.com> wrote:
>
> Unfortunately batch mode has only made it *worse* - I'm sure it's not
> intentional, but it seems that it's still running the CI pipelines
> individually after the batch pipeline passes and not merging them
> right away, which completely defeats the point. See, for example,
> !14213 which has gone through 8 cycles being batched with earlier MRs,
> 5 of those passing only to have an earlier job in the batch spuriously
> fail when actually merging and Marge seemingly giving up on merging it
> (???). As I type it was "lucky" enough to be the first job in a batch
> which passed and is currently running its pipeline and is blocked on
> iris-whl-traces-performance (I have !14453 to disable that broken job,
> but who knows with the Marge chaos when it's going to get merged...).

Ah, I guess I spoke too soon and my analysis of what's happening was
wrong, because in the meantime she merged !14121 without waiting for
that broken job to timeout. It's somehow just completely ignoring
!14213 and merging other things on top of it, I guess? Either way
something is seriously wrong.

>
> Stepping back, I think it was a bad idea to push a "I think this might
> help" type change like this without first carefully monitoring things
> afterwards. An hour or so of babysitting Marge would've caught that
> this wasn't working, and would've prevented many hours of backlog and
> perception of general CI instability.
>
> Connor
>
> On Fri, Jan 7, 2022 at 6:36 AM Emma Anholt <emma at anholt.net> wrote:
> >
> > As you've probably noticed, there have been issues with git access
> > this week.  The fd.o sysadmins are desperately trying to stay on
> > vacation because they do deserve a break, but have still been working
> > on the problem and a couple of solutions haven't worked out yet.
> > Hopefully we'll have some news soon.
> >
> > Due to these ongoing git timeouts, our CI runners have been getting
> > bogged down with stalled jobs and causing a lot of spurious failures
> > where the pipeline doesn't get all its jobs assigned to runners before
> > Marge gives up.  Today, I asked daniels to bump Marge's pipeline
> > timeout to 4 hours (up from 1).  To get MRs flowing at a similar rate
> > despite the longer total pipeline times, we also enabled batch mode as
> > described at https://github.com/smarkets/marge-bot/blob/master/README.md#batching-merge-requests.
> >
> > It means there are now theoretical cases as described in the README
> > where Marge might merge a set of code that leaves main broken.
> > However, those cases are pretty obscure, and I expect that failure
> > rate to be much lower than the existing "you can merge flaky code"
> > failure rate and worth the risk.
> >
> > Hopefully this gets us all productive again.