[Mesa-dev] postmortem: arm64_test job timeouts today

Eric Anholt eric at anholt.net
Sat Jul 18 00:11:16 UTC 2020

With the landing of
https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/5839 we
entered a state that caused future pipelines to fail.

This is due to an unfortunate interaction between gitlab MRs and
ci-templates' model of container image distribution: MRs are tested in
the submitter's repository, but ci-templates only replicates container
images from mesa/mesa to user repositories.  So, if someone has ever
uploaded a container image to their repo under a tag that can pass the
tests, they can land code that makes all future pipelines fail.  In
this case, arm64_test was near the timeout for the pipelines and was
failing for most people, including marge, and marge's queue ended up
quite backed up.

There are a few things we can do to mitigate this particular job's timeouts:

- https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/5669 would
give us twice the -j flags for our builds on fd.o's x86 runners (such
as for the arm64_test job)
- https://gitlab.freedesktop.org/mesa/mesa/-/issues/3123 would let us
move back to debian testing or unstable for the test images, and use
more debian packages (like apitrace) instead of hand-building them in
our CI system
- https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=962718 would let
us cut a big portion of the test container build times

However, I have no solution for the general problem of "users can
merge code that causes failing container builds for others."  Could we
make ci-templates not use registry-cached containers in marge-bot
pipelines, and then replicate the image up to mesa/mesa somehow?

More information about the mesa-dev mailing list