[Mesa-dev] Mesa CI is too slow

Mon Feb 18 18:46:22 UTC 2019

On Mon, Feb 18, 2019 at 9:32 AM Daniel Stone <daniel at fooishbar.org> wrote:
>
> Hi all,
> A few people have noted that Mesa's GitLab CI is just too slow, and
> not usable in day-to-day development, which is a massive shame.
>
> I looked into it a bit this morning, and also discussed it with Emil,
> though nothing in this is speaking for him.
>
> Taking one of the last runs as representative (nothing in it looks
> like an outlier to me, and 7min to build RadeonSI seems entirely
> reasonable):
> https://gitlab.freedesktop.org/mesa/mesa/pipelines/19692/builds
>
> This run executed 24 jobs, which is beyond the limit of our CI
> parallelism. As documented on
> https://www.freedesktop.org/wiki/Infrastructure/ we have 14 concurrent
> job slots (each with roughly 4 vCPUs). Those 24 jobs cumulatively took
> 177 minutes of execution time, taking 120 minutes for the end-to-end
> pipeline.
>
> 177 minutes of runtime is too long for the runners we have now: if it
> perfectly occupies all our runners it will take over 12 minutes, which
> means that even if no-one else was using the runners, they could
> execute 5 Mesa builds per hour at full occupancy. Unfortunately,
> VirGL, Wayland/Weston, libinput, X.Org, IGT, GStreamer,
> NetworkManager/ModemManager, Bolt, Poppler, etc, would all probably
> have something to say about that.
>
> When the runners aren't occupied and there's less contention for jobs,
> it looks quite good:
> https://gitlab.freedesktop.org/anholt/mesa/pipelines/19621/builds
>
> This run 'only' took 20.5 minutes to execute, but then again, 3
> pipelines per hour isn't that great either.
>
> Two hours of end-to-end pipeline time is also obviously far too long.
> Amongst other things, it practically precludes pre-merge CI: by the
> time your build has finished, someone will have pushed to the tree, so
> you need to start again. Even if we serialised it through a bot, that
> would limit us to pushing 12 changesets per day, which seems too low.
>
> I'm currently talking to two different hosts to try to get more
> sponsored time for CI runners. Those are both on hold this week due to
> travel / personal circumstances, but I'll hopefully find out more next
> week. Eric E filed an issue
> (https://gitlab.freedesktop.org/freedesktop/freedesktop/issues/120) to
> enable ccache cache but I don't see myself having the time to do it
> before next month.
>
> In the meantime, it would be great to see how we could reduce the
> number of jobs Mesa runs for each pipeline. Given we're already
> exceeding the limits of parallelism, having so many independent jobs
> isn't reducing the end-to-end pipeline time, but instead just
> duplicating effort required to fetch and check out sources, cache (in
> the future), start the container, run meson or ./configure, and build
> any common files.
>
> I'm taking it as a given that at least three separate builds are
> required: autotools, Meson, and SCons. Fair enough.
>
> It's been suggested to me that SWR should remain separate, as it takes
> longer to build than the other drivers, and getting fast feedback is
> important, which is fair enough.
>
> Suggestion #1: merge scons-swr into scons-llvm. scons-nollvm will
> already provide fast feedback on if we've broken the SCons build, and
> the rest is pretty uninteresting, so merging scons-swr into scons-llvm
> might help cut down on duplication.
>
> Suggestion #2: merge the misc Gallium jobs together. Building
> gallium-radeonsi and gallium-st-other are both relatively quick. We
> could merge these into gallium-drivers-other for a very small increase
> in overall runtime for that job, and save ourselves probably about 10%
> of the overall build time here.
>
> Suggestion #3: don't build so much LLVM in autotools. The Meson
> clover-llvm builds take half the time the autotools builds do. Perhaps
> we should only build one LLVM variant within autotools (to test the
> autotools LLVM selection still works), and then build all the rest
> only in Meson. That would be good for another 15-20% reduction in
> overall pipeline run time.
>
> Suggestion #4 (if necessary): build SWR less frequently. Can we
> perhaps demote SWR to an 'only:' job which will only rebuild SWR if
> SWR itself or Gallium have changed? This would save a good chunk of
> runtime - again close to 10%.
>
> Doing the above would reduce the run time fairly substantially, for
> what I can tell is no loss in functional coverage, and bring the
> parallelism to a mere 1.5x oversubscription of the whole
> organisation's available job slots, from the current 2x.
>
> Any thoughts?

All of your suggestions seem reasonable.

Removing autotools [1] would obviously reduce the number of builds.

If I understood correctly, we are kicking off a CI run for every push
to a fork of the Mesa repo, and not just for merge requests. I think
that's absolutely the wrong thing to do. CI for personal branches
should be opt-in.

[1] https://bugs.freedesktop.org/show_bug.cgi?id=mesa-autotools-removal