systemd-sysupdate support for slow rollout (aka A/B testing)

Tue Jan 2 10:16:15 UTC 2024

On Mi, 20.12.23 19:04, Nils Kattenbeck (nilskemail at gmail.com) wrote:

> Hey everyone,
>
> does sysupdate currently support any way to slowly roll out updates
> where the server providing the files can be in control? This would be
> used to slowly make a new version available and have it at e.g. 1%
> adoption for a day to monitor regressions before increasing the
> coverage. I was unable to find any information about it in the
> documentation.

This is currently not available, no.

The idea so far was always that the server is dumb, and the client
picks the release it wants.

I have thought about this usecase a while back, and my thinking was
that such a staged update logic should be driven by the machine
ID. i.e. we should teach sysupdate a simple logic that allows pattern
matching of new versions based on some arithmetic of the machine
ID. More specifically, include some value in the URL pattern that
indicates the percentage of hosts that shall update to this
release. Then, each client takes its machine ID, treats it as an
integer and calculates modulo 100 of it or so, and then checks if the
resulting value is below the intended percentage, and if so it
updates, otherwise it doesn't.

(or something like that, the above is probably not ideal, since it
would mean it's always the same hosts that try a new release first,
and it probably should be evened out across the set of clients).

This would then mean for the server that it would first serve
foobar_47.11_3.raw which would be version 47.11 of the OS, and 3% of
the hosts would update to it. And then, once you collected enough
feedback you'd rename the file to foobar_47.11_25.raw and 25% of the
hosts would switch over. Finally you'd set the value to 100 (or maybe
just drop it, which should be considered equivalent to 100), and then
all remaining hosts would update.

The effect of this is that client's could still explicitly upgrade if
they want, and the updates would be entirely driven by the clients,
but simply via naming the download images the server can control that
"by default" only the chosen number of clients update.

> Currently it seems like I would have to implement a different service
> which calls the sysupdate binary (or uses dbus once #28134 has landed)
> and then decides based on some other information.
>
> One idea I had would be that systemd-pull could send the machine-id
> based on which the server could then decide to provide the newer file
> (e.g. last two chars == "00" would roll it out to ~1/255). Though I am
> not sure if sd-pull is supposed to be "anonymous", i.e. do not provide
> this identifying information. Another drawback of this would be that
> stateless systems which reboot often get a new machine-id each boot,
> thus having an increased chance to get the newer version.

So this idea is not entirely different from my idea, I was just
thinking about pushing this into sysupdate rather than pull.

> Does anything like this already exist or is planned? Or should that be
> done by different applications on the client side?

I think it makes a ton of sense to add this to sysupdate. Would love
to review/merge a patch for that.

> I also remember there being a discussion about plugging in different
> sd-pull like implementations/backends[1] to support delta updates,
> other transports, or TLS client authentication. This could at least be
> adapted to support my idea to send the machine-id as an HTTP header
> (e.g. X-MACHINE-ID).

If we can avoid it, I'd always adopt a logic whether identifying info
doesn't have to be sent to the server. After all the logic should be
generic and applicable in scenarios where the client should get
anonymity as much as it wants.

The machine-id we usually consider a "half-secret", i.e. all local
programs get access to it (unless sandboxed), but they are not
supposed to be send it across the wire. If they really need to send
some identifier across the wire they should derive an app-specific ID
instead, which we make easy to acquire via
sd_id128_get_machine_app_specific().

But better than app-specific machine IDs are no machine IDs at all in
the protocol, if we can get away with it. Hence, my idea of doing the
rollout percentage logic client-side.

Lennart

--
Lennart Poettering, Berlin