git and Marge troubles this week

Fri Jan 7 18:06:55 UTC 2022

On Fri, Jan 7, 2022 at 6:32 PM Emma Anholt <emma at anholt.net> wrote:
>
> On Fri, Jan 7, 2022 at 6:18 AM Connor Abbott <cwabbott0 at gmail.com> wrote:
> >
> > Unfortunately batch mode has only made it *worse* - I'm sure it's not
> > intentional, but it seems that it's still running the CI pipelines
> > individually after the batch pipeline passes and not merging them
> > right away, which completely defeats the point. See, for example,
> > !14213 which has gone through 8 cycles being batched with earlier MRs,
> > 5 of those passing only to have an earlier job in the batch spuriously
> > fail when actually merging and Marge seemingly giving up on merging it
> > (???). As I type it was "lucky" enough to be the first job in a batch
> > which passed and is currently running its pipeline and is blocked on
> > iris-whl-traces-performance (I have !14453 to disable that broken job,
> > but who knows with the Marge chaos when it's going to get merged...).
> >
> > Stepping back, I think it was a bad idea to push a "I think this might
> > help" type change like this without first carefully monitoring things
> > afterwards. An hour or so of babysitting Marge would've caught that
> > this wasn't working, and would've prevented many hours of backlog and
> > perception of general CI instability.
>
> I spent the day watching marge, like I do every day.  Looking at the
> logs, we got 0 MRs in during my work hours PST, out of about 14 or so
> marge assignments that day.  Leaving marge broken for the night would
> have been indistinguishable from the status quo, was my assessment.
>
> There was definitely some extra spam about trying batches, more than
> there were actual batches attempted.  My guess would be gitlab
> connection reliability stuff, but I'm not sure.
>
> Of the 5 batches marge attempted before the change was reverted, three
> fell to https://gitlab.freedesktop.org/mesa/mesa/-/issues/5837, one to
> the git fetch fails, and one to a new timeout I don't think I've seen
> before: https://gitlab.freedesktop.org/mesa/mesa/-/jobs/17357425#L1731.
> Of all the sub-MRs involved in those batches, I think two of those
> might have gotten through by dodging the LAVA lab fail.  Marge's batch
> backoff did work, and !14436 and maybe !14433 landed during that time.

Looks like I was a bit off with the numbers, but I double-checked and
these batch MRs containing !14213 all passed and yet it didn't get
merged: !14456, !14452, !14449, !14445, !14440, !14438... so actually
6.

!14436, for whatever reason, was never put into a batch - it worked as
before the change, probably because there weren't other MRs to combine
it with at the time. I've been looking through Marge's history and
can't find a single example where a successful batched merge happened.
Typically, when there's a successful batch MR, the first MR in the
batch gets rebased by Marge but not merged, instead its pipeline gets
run and (seemingly) Marge moves on and picks some other MR, not even
waiting for it to finish. Since (iirc) Marge picks MRs by
least-recently-active and this generates activity, it gets shoved to
the back of the queue and then gets locked in a cycle (!14213 is the
worst, but there are others). I think this happens because Mesa gates
acceptance on the pipeline passing, and therefore when Marge goes to
merge the pipelines in a batch she can't, and she just moves on to the
next one.

Connor