git and Marge troubles this week
cwabbott0 at gmail.com
Fri Jan 7 14:18:29 UTC 2022
Unfortunately batch mode has only made it *worse* - I'm sure it's not
intentional, but it seems that it's still running the CI pipelines
individually after the batch pipeline passes and not merging them
right away, which completely defeats the point. See, for example,
!14213 which has gone through 8 cycles being batched with earlier MRs,
5 of those passing only to have an earlier job in the batch spuriously
fail when actually merging and Marge seemingly giving up on merging it
(???). As I type it was "lucky" enough to be the first job in a batch
which passed and is currently running its pipeline and is blocked on
iris-whl-traces-performance (I have !14453 to disable that broken job,
but who knows with the Marge chaos when it's going to get merged...).
Stepping back, I think it was a bad idea to push a "I think this might
help" type change like this without first carefully monitoring things
afterwards. An hour or so of babysitting Marge would've caught that
this wasn't working, and would've prevented many hours of backlog and
perception of general CI instability.
On Fri, Jan 7, 2022 at 6:36 AM Emma Anholt <emma at anholt.net> wrote:
> As you've probably noticed, there have been issues with git access
> this week. The fd.o sysadmins are desperately trying to stay on
> vacation because they do deserve a break, but have still been working
> on the problem and a couple of solutions haven't worked out yet.
> Hopefully we'll have some news soon.
> Due to these ongoing git timeouts, our CI runners have been getting
> bogged down with stalled jobs and causing a lot of spurious failures
> where the pipeline doesn't get all its jobs assigned to runners before
> Marge gives up. Today, I asked daniels to bump Marge's pipeline
> timeout to 4 hours (up from 1). To get MRs flowing at a similar rate
> despite the longer total pipeline times, we also enabled batch mode as
> described at https://github.com/smarkets/marge-bot/blob/master/README.md#batching-merge-requests.
> It means there are now theoretical cases as described in the README
> where Marge might merge a set of code that leaves main broken.
> However, those cases are pretty obscure, and I expect that failure
> rate to be much lower than the existing "you can merge flaky code"
> failure rate and worth the risk.
> Hopefully this gets us all productive again.
More information about the mesa-dev