git and Marge troubles this week

Emma Anholt emma at
Fri Jan 7 17:32:26 UTC 2022

On Fri, Jan 7, 2022 at 6:18 AM Connor Abbott <cwabbott0 at> wrote:
> Unfortunately batch mode has only made it *worse* - I'm sure it's not
> intentional, but it seems that it's still running the CI pipelines
> individually after the batch pipeline passes and not merging them
> right away, which completely defeats the point. See, for example,
> !14213 which has gone through 8 cycles being batched with earlier MRs,
> 5 of those passing only to have an earlier job in the batch spuriously
> fail when actually merging and Marge seemingly giving up on merging it
> (???). As I type it was "lucky" enough to be the first job in a batch
> which passed and is currently running its pipeline and is blocked on
> iris-whl-traces-performance (I have !14453 to disable that broken job,
> but who knows with the Marge chaos when it's going to get merged...).
> Stepping back, I think it was a bad idea to push a "I think this might
> help" type change like this without first carefully monitoring things
> afterwards. An hour or so of babysitting Marge would've caught that
> this wasn't working, and would've prevented many hours of backlog and
> perception of general CI instability.

I spent the day watching marge, like I do every day.  Looking at the
logs, we got 0 MRs in during my work hours PST, out of about 14 or so
marge assignments that day.  Leaving marge broken for the night would
have been indistinguishable from the status quo, was my assessment.

There was definitely some extra spam about trying batches, more than
there were actual batches attempted.  My guess would be gitlab
connection reliability stuff, but I'm not sure.

Of the 5 batches marge attempted before the change was reverted, three
fell to, one to
the git fetch fails, and one to a new timeout I don't think I've seen
Of all the sub-MRs involved in those batches, I think two of those
might have gotten through by dodging the LAVA lab fail.  Marge's batch
backoff did work, and !14436 and maybe !14433 landed during that time.

