[Mesa-dev] Mesa CI is too slow

Mon Feb 18 18:58:25 UTC 2019

On Monday, 2019-02-18 17:31:41 +0000, Daniel Stone wrote:
> Hi all,
> A few people have noted that Mesa's GitLab CI is just too slow, and
> not usable in day-to-day development, which is a massive shame.

Agreed :/

> 
> I looked into it a bit this morning, and also discussed it with Emil,
> though nothing in this is speaking for him.
> 
> Taking one of the last runs as representative (nothing in it looks
> like an outlier to me, and 7min to build RadeonSI seems entirely
> reasonable):
> https://gitlab.freedesktop.org/mesa/mesa/pipelines/19692/builds
> 
> This run executed 24 jobs, which is beyond the limit of our CI
> parallelism. As documented on
> https://www.freedesktop.org/wiki/Infrastructure/ we have 14 concurrent
> job slots (each with roughly 4 vCPUs). Those 24 jobs cumulatively took
> 177 minutes of execution time, taking 120 minutes for the end-to-end
> pipeline.
> 
> 177 minutes of runtime is too long for the runners we have now: if it
> perfectly occupies all our runners it will take over 12 minutes, which
> means that even if no-one else was using the runners, they could
> execute 5 Mesa builds per hour at full occupancy. Unfortunately,
> VirGL, Wayland/Weston, libinput, X.Org, IGT, GStreamer,
> NetworkManager/ModemManager, Bolt, Poppler, etc, would all probably
> have something to say about that.
> 
> When the runners aren't occupied and there's less contention for jobs,
> it looks quite good:
> https://gitlab.freedesktop.org/anholt/mesa/pipelines/19621/builds
> 
> This run 'only' took 20.5 minutes to execute, but then again, 3
> pipelines per hour isn't that great either.
> 
> Two hours of end-to-end pipeline time is also obviously far too long.
> Amongst other things, it practically precludes pre-merge CI: by the
> time your build has finished, someone will have pushed to the tree, so
> you need to start again. Even if we serialised it through a bot, that
> would limit us to pushing 12 changesets per day, which seems too low.
> 
> I'm currently talking to two different hosts to try to get more
> sponsored time for CI runners. Those are both on hold this week due to
> travel / personal circumstances, but I'll hopefully find out more next
> week. Eric E filed an issue
> (https://gitlab.freedesktop.org/freedesktop/freedesktop/issues/120) to
> enable ccache cache but I don't see myself having the time to do it
> before next month.

Just to chime in to this point, I also have an MR to enable ccache per
runner, which with our static runners setup is not much worse than the
shared cache:
https://gitlab.freedesktop.org/mesa/mesa/merge_requests/240

>From my cursory testing, this should already cut the compilations by
80-90% :)

> 
> In the meantime, it would be great to see how we could reduce the
> number of jobs Mesa runs for each pipeline. Given we're already
> exceeding the limits of parallelism, having so many independent jobs
> isn't reducing the end-to-end pipeline time, but instead just
> duplicating effort required to fetch and check out sources, cache (in
> the future), start the container, run meson or ./configure, and build
> any common files.
> 
> I'm taking it as a given that at least three separate builds are
> required: autotools, Meson, and SCons. Fair enough.
> 
> It's been suggested to me that SWR should remain separate, as it takes
> longer to build than the other drivers, and getting fast feedback is
> important, which is fair enough.
> 
> Suggestion #1: merge scons-swr into scons-llvm. scons-nollvm will
> already provide fast feedback on if we've broken the SCons build, and
> the rest is pretty uninteresting, so merging scons-swr into scons-llvm
> might help cut down on duplication.
> 
> Suggestion #2: merge the misc Gallium jobs together. Building
> gallium-radeonsi and gallium-st-other are both relatively quick. We
> could merge these into gallium-drivers-other for a very small increase
> in overall runtime for that job, and save ourselves probably about 10%
> of the overall build time here.
> 
> Suggestion #3: don't build so much LLVM in autotools. The Meson
> clover-llvm builds take half the time the autotools builds do. Perhaps
> we should only build one LLVM variant within autotools (to test the
> autotools LLVM selection still works), and then build all the rest
> only in Meson. That would be good for another 15-20% reduction in
> overall pipeline run time.
> 
> Suggestion #4 (if necessary): build SWR less frequently. Can we
> perhaps demote SWR to an 'only:' job which will only rebuild SWR if
> SWR itself or Gallium have changed? This would save a good chunk of
> runtime - again close to 10%.
> 
> Doing the above would reduce the run time fairly substantially, for
> what I can tell is no loss in functional coverage, and bring the
> parallelism to a mere 1.5x oversubscription of the whole
> organisation's available job slots, from the current 2x.
> 
> Any thoughts?

Your suggestions all sound good, although I can't speak for #1 and #2.

#3 sounds good, I guess we can keep meson builds with the "oldest supported
llvm" and the "current llvm version", and only the "oldest supported"
for autotools?

I think suggestion #4 (tracking which files actually affect the build)
would be good for all of them, but quickly complicated to keep up to
date. I guess those that have trivial files to track should get that
treatment, and we can leave the complicated ones as is.

---

You've suggested reducing the amount that's built (ccache,
dropping/merging jobs) and making it more parallel (fewer jobs), but
there's another avenue to look at: run the CI less often.

In my opinion, the CI should run on every single commit. Since that's
not realistic, we need to decide what's essential.
>From most to least important:

- master: everything that hits master needs to be build- and smoke-tested

- stable branches: we obviously don't want to break stable branches

- merge requests: the reason I wrote the CI was to automatically test MRs

- personal work on forks: it would be really useful to test things
  before sending out an MR, especially with the less-used build systems
  that we often forget to update, but this should be opt-in, not opt-out
  as it is right now.

Ideally, this means we add this to the .gitlab.yml:
  only:
    - master
    - merge_requests
    - ci/*

Until this morning, I thought `merge_requests` was an Enterprise Edition
only feature, which I why I didn't put it in, but it appears I was wrong,
see:
https://docs.gitlab.com/ce/ci/merge_request_pipelines/
(Thanks Caio for reading through the docs more carefully than I did! :)

I'll send an MR in a bit with the above. This will mean that master and
MRs get automatic CI, and pushes on forks don't (except the fork's
master), but one can push a `ci/*` branch to their own fork to run the
CI on it.

I think this should massively drop the use of the CI, but mostly remove
unwanted uses :)

Cheers,
  Eric