[systemd-devel] systemd 240 released
Zbigniew Jędrzejewski-Szmek
zbyszek at in.waw.pl
Fri Dec 21 19:02:57 UTC 2018
systemd System and Service Manager
CHANGES WITH 240:
* NoNewPrivileges=yes has been set for all long-running services
implemented by systemd. Previously, this was problematic due to
SELinux (as this would also prohibit the transition from PID1's label
to the service's label). This restriction has since been lifted, but
an SELinux policy update is required.
(See e.g. https://github.com/fedora-selinux/selinux-policy/pull/234.)
* DynamicUser=yes is dropped from systemd-networkd.service,
systemd-resolved.service and systemd-timesyncd.service, which was
enabled in v239 for systemd-networkd.service and systemd-resolved.service,
and since v236 for systemd-timesyncd.service. The users and groups
systemd-network, systemd-resolve and systemd-timesync are created
by systemd-sysusers again. Distributors or system administrators
may need to create these users and groups if they not exist (or need
to re-enable DynamicUser= for those units) while upgrading systemd.
* When unit files are loaded from disk, previously systemd would
sometimes (depending on the unit loading order) load units from the
target path of symlinks in .wants/ or .requires/ directories of other
units. This meant that unit could be loaded from different paths
depending on whether the unit was requested explicitly or as a
dependency of another unit, not honouring the priority of directories
in search path. It also meant that it was possible to successfully
load and start units which are not found in the unit search path, as
long as they were requested as a dependency and linked to from
.wants/ or .requires/. The target paths of those symlinks are not
used for loading units anymore and the unit file must be found in
the search path.
* A new service type has been added: Type=exec. It's very similar to
Type=simple but ensures the service manager will wait for both fork()
and execve() of the main service binary to complete before proceeding
with follow-up units. This is primarily useful so that the manager
propagates any errors in the preparation phase of service execution
back to the job that requested the unit to be started. For example,
consider a service that has ExecStart= set to a file system binary
that doesn't exist. With Type=simple starting the unit would be
considered instantly successful, as only fork() has to complete
successfully and the manager does not wait for execve(), and hence
its failure is seen "too late". With the new Type=exec service type
starting the unit will fail, as the manager will wait for the
execve() and notice its failure, which is then propagated back to the
start job.
NOTE: with the next release 241 of systemd we intend to change the
systemd-run tool to default to Type=exec for transient services
started by it. This should be mostly safe, but in specific corner
cases might result in problems, as the systemd-run tool will then
block on NSS calls (such as user name look-ups due to User=) done
between the fork() and execve(), which under specific circumstances
might cause problems. It is recommended to specify "-p Type=simple"
explicitly in the few cases where this applies. For regular,
non-transient services (i.e. those defined with unit files on disk)
we will continue to default to Type=simple.
* The Linux kernel's current default RLIMIT_NOFILE resource limit for
userspace processes is set to 1024 (soft) and 4096
(hard). Previously, systemd passed this on unmodified to all
processes it forked off. With this systemd release the hard limit
systemd passes on is increased to 512K, overriding the kernel's
defaults and substantially increasing the number of simultaneous file
descriptors unprivileged userspace processes can allocate. Note that
the soft limit remains at 1024 for compatibility reasons: the
traditional UNIX select() call cannot deal with file descriptors >=
1024 and increasing the soft limit globally might thus result in
programs unexpectedly allocating a high file descriptor and thus
failing abnormally when attempting to use it with select() (of
course, programs shouldn't use select() anymore, and prefer
poll()/epoll, but the call unfortunately remains undeservedly popular
at this time). This change reflects the fact that file descriptor
handling in the Linux kernel has been optimized in more recent
kernels and allocating large numbers of them should be much cheaper
both in memory and in performance than it used to be. Programs that
want to take benefit of the increased limit have to "opt-in" into
high file descriptors explicitly by raising their soft limit. Of
course, when they do that they must acknowledge that they cannot use
select() anymore (and neither can any shared library they use — or
any shared library used by any shared library they use and so on).
Which default hard limit is most appropriate is of course hard to
decide. However, given reports that ~300K file descriptors are used
in real-life applications we believe 512K is sufficiently high as new
default for now. Note that there are also reports that using very
high hard limits (e.g. 1G) is problematic: some software allocates
large arrays with one element for each potential file descriptor
(Java, …) — a high hard limit thus triggers excessively large memory
allocations in these applications. Hopefully, the new default of 512K
is a good middle ground: higher than what real-life applications
currently need, and low enough for avoid triggering excessively large
allocations in problematic software. (And yes, somebody should fix
Java.)
* The fs.nr_open and fs.file-max sysctls are now automatically bumped
to the highest possible values, as separate accounting of file
descriptors is no longer necessary, as memcg tracks them correctly as
part of the memory accounting anyway. Thus, from the four limits on
file descriptors currently enforced (fs.file-max, fs.nr_open,
RLIMIT_NOFILE hard, RLIMIT_NOFILE soft) we turn off the first two,
and keep only the latter two. A set of build-time options
(-Dbump-proc-sys-fs-file-max=no and -Dbump-proc-sys-fs-nr-open=no)
has been added to revert this change in behaviour, which might be
an option for systems that turn off memcg in the kernel.
* When no /etc/locale.conf file exists (and hence no locale settings
are in place), systemd will now use the "C.UTF-8" locale by default,
and set LANG= to it. This locale is supported by various
distributions including Fedora, with clear indications that upstream
glibc is going to make it available too. This locale enables UTF-8
mode by default, which appears appropriate for 2018.
* The "net.ipv4.conf.all.rp_filter" sysctl will now be set to 2 by
default. This effectively switches the RFC3704 Reverse Path filtering
from Strict mode to Loose mode. This is more appropriate for hosts
that have multiple links with routes to the same networks (e.g.
a client with a Wi-Fi and Ethernet both connected to the internet).
Consult the kernel documentation for details on this sysctl:
https://www.kernel.org/doc/Documentation/networking/ip-sysctl.txt
* CPUAccounting=yes no longer enables the CPU controller when using
kernel 4.15+ and the unified cgroup hierarchy, as required accounting
statistics are now provided independently from the CPU controller.
* Support for disabling a particular cgroup controller within a sub-tree
has been added through the DisableControllers= directive.
* cgroup_no_v1=all on the kernel command line now also implies
using the unified cgroup hierarchy, unless one explicitly passes
systemd.unified_cgroup_hierarchy=0 on the kernel command line.
* The new "MemoryMin=" unit file property may now be used to set the
memory usage protection limit of processes invoked by the unit. This
controls the cgroupsv2 memory.min attribute. Similarly, the new
"IODeviceLatencyTargetSec=" property has been added, wrapping the new
cgroupsv2 io.latency cgroup property for configuring per-service I/O
latency.
* systemd now supports the cgroupsv2 devices BPF logic, as counterpart
to the cgroupsv1 "devices" cgroup controller.
* systemd-escape now is able to combine --unescape with --template. It
also learnt a new option --instance for extracting and unescaping the
instance part of a unit name.
* sd-bus now provides the sd_bus_message_readv() which is similar to
sd_bus_message_read() but takes a va_list object. The pair
sd_bus_set_method_call_timeout() and sd_bus_get_method_call_timeout()
has been added for configuring the default method call timeout to
use. sd_bus_error_move() may be used to efficiently move the contents
from one sd_bus_error structure to another, invalidating the
source. sd_bus_set_close_on_exit() and sd_bus_get_close_on_exit() may
be used to control whether a bus connection object is automatically
flushed when an sd-event loop is exited.
* When processing classic BSD syslog log messages, journald will now
save the original time-stamp string supplied in the new
SYSLOG_TIMESTAMP= journal field. This permits consumers to
reconstruct the original BSD syslog message more correctly.
* StandardOutput=/StandardError= in service files gained support for
new "append:…" parameters, for connecting STDOUT/STDERR of a service
to a file, and appending to it.
* The signal to use as last step of killing of unit processes is now
configurable. Previously it was hard-coded to SIGKILL, which may now
be overridden with the new KillSignal= setting. Note that this is the
signal used when regular termination (i.e. SIGTERM) does not suffice.
Similarly, the signal used when aborting a program in case of a
watchdog timeout may now be configured too (WatchdogSignal=).
* The XDG_SESSION_DESKTOP environment variable may now be configured in
the pam_systemd argument line, using the new desktop= switch. This is
useful to initialize it properly from a display manager without
having to touch C code.
* Most configuration options that previously accepted percentage values
now also accept permille values with the '‰' suffix (instead of '%').
* systemd-resolved may now optionally use OpenSSL instead of GnuTLS for
DNS-over-TLS.
* systemd-resolved's configuration file resolved.conf gained a new
option ReadEtcHosts= which may be used to turn off processing and
honoring /etc/hosts entries.
* The "--wait" switch may now be passed to "systemctl
is-system-running", in which case the tool will synchronously wait
until the system finished start-up.
* hostnamed gained a new bus call to determine the DMI product UUID.
* On x86-64 systemd will now prefer using the RDRAND processor
instruction over /dev/urandom whenever it requires randomness that
neither has to be crypto-grade nor should be reproducible. This
should substantially reduce the amount of entropy systemd requests
from the kernel during initialization on such systems, though not
reduce it to zero. (Why not zero? systemd still needs to allocate
UUIDs and such uniquely, which require high-quality randomness.)
* networkd gained support for Foo-Over-UDP, ERSPAN and ISATAP
tunnels. It also gained a new option ForceDHCPv6PDOtherInformation=
for forcing the "Other Information" bit in IPv6 RA messages. The
bonding logic gained four new options AdActorSystemPriority=,
AdUserPortKey=, AdActorSystem= for configuring various 802.3ad
aspects, and DynamicTransmitLoadBalancing= for enabling dynamic
shuffling of flows. The tunnel logic gained a new
IPv6RapidDeploymentPrefix= option for configuring IPv6 Rapid
Deployment. The policy rule logic gained four new options IPProtocol=,
SourcePort= and DestinationPort=, InvertRule=. The bridge logic gained
support for the MulticastToUnicast= option. networkd also gained
support for configuring static IPv4 ARP or IPv6 neighbor entries.
* .preset files (as read by 'systemctl preset') may now be used to
instantiate services.
* /etc/crypttab now understands the sector-size= option to configure
the sector size for an encrypted partition.
* Key material for encrypted disks may now be placed on a formatted
medium, and referenced from /etc/crypttab by the UUID of the file
system, followed by "=" suffixed by the path to the key file.
* The "collect" udev component has been removed without replacement, as
it is neither used nor maintained.
* When the RuntimeDirectory=, StateDirectory=, CacheDirectory=,
LogsDirectory=, ConfigurationDirectory= settings are used in a
service the executed processes will now receive a set of environment
variables containing the full paths of these directories.
Specifically, RUNTIME_DIRECTORY=, STATE_DIRECTORY, CACHE_DIRECTORY,
LOGS_DIRECTORY, CONFIGURATION_DIRECTORY are now set if these options
are used. Note that these options may be used multiple times per
service in which case the resulting paths will be concatenated and
separated by colons.
* Predictable interface naming has been extended to cover InfiniBand
NICs. They will be exposed with an "ib" prefix.
* tmpfiles.d/ line types may now be suffixed with a '-' character, in
which case the respective line failing is ignored.
* .link files may now be used to configure the equivalent to the
"ethtool advertise" commands.
* The sd-device.h and sd-hwdb.h APIs are now exported, as an
alternative to libudev.h. Previously, the latter was just an internal
wrapper around the former, but now these two APIs are exposed
directly.
* sd-id128.h gained a new function sd_id128_get_boot_app_specific()
which calculates an app-specific boot ID similar to how
sd_id128_get_machine_app_specific() generates an app-specific machine
ID.
* A new tool systemd-id128 has been added that can be used to determine
and generate various 128bit IDs.
* /etc/os-release gained two new standardized fields DOCUMENTATION_URL=
and LOGO=.
* systemd-hibernate-resume-generator will now honor the "noresume"
kernel command line option, in which case it will bypass resuming
from any hibernated image.
* The systemd-sleep.conf configuration file gained new options
AllowSuspend=, AllowHibernation=, AllowSuspendThenHibernate=,
AllowHybridSleep= for prohibiting specific sleep modes even if the
kernel exports them.
* portablectl is now officially supported and has thus moved to
/usr/bin/.
* bootctl learnt the two new commands "set-default" and "set-oneshot"
for setting the default boot loader item to boot to (either
persistently or only for the next boot). This is currently only
compatible with sd-boot, but may be implemented on other boot loaders
too, that follow the boot loader interface. The updated interface is
now documented here:
https://systemd.io/BOOT_LOADER_INTERFACE
* A new kernel command line option systemd.early_core_pattern= is now
understood which may be used to influence the core_pattern PID 1
installs during early boot.
* busctl learnt two new options -j and --json= for outputting method
call replies, properties and monitoring output in JSON.
* journalctl's JSON output now supports simple ANSI coloring as well as
a new "json-seq" mode for generating RFC7464 output.
* Unit files now support the %g/%G specifiers that resolve to the UNIX
group/GID of the service manager runs as, similar to the existing
%u/%U specifiers that resolve to the UNIX user/UID.
* systemd-logind learnt a new global configuration option
UserStopDelaySec= that may be set in logind.conf. It specifies how
long the systemd --user instance shall remain started after a user
logs out. This is useful to speed up repetitive re-connections of the
same user, as it means the user's service manager doesn't have to be
stopped/restarted on each iteration, but can be reused between
subsequent options. This setting defaults to 10s. systemd-logind also
exports two new properties on its Manager D-Bus objects indicating
whether the system's lid is currently closed, and whether the system
is on AC power.
* systemd gained support for a generic boot counting logic, which
generically permits automatic reverting to older boot loader entries
if newer updated ones don't work. The boot loader side is implemented
in sd-boot, but is kept open for other boot loaders too. For details
see:
https://systemd.io/AUTOMATIC_BOOT_ASSESSMENT
* The SuccessAction=/FailureAction= unit file settings now learnt two
new parameters: "exit" and "exit-force", which result in immediate
exiting of the service manager, and are only useful in systemd --user
and container environments.
* Unit files gained support for a pair of options
FailureActionExitStatus=/SuccessActionExitStatus= for configuring the
exit status to use as service manager exit status when
SuccessAction=/FailureAction= is set to exit or exit-force.
* A pair of LogRateLimitIntervalSec=/LogRateLimitBurst= per-service
options may now be used to configure the log rate limiting applied by
journald per-service.
* systemd-analyze gained a new verb "timespan" for parsing and
normalizing time span values (i.e. strings like "5min 7s 8us").
* systemd-analyze also gained a new verb "security" for analyzing the
security and sand-boxing settings of services in order to determine an
"exposure level" for them, indicating whether a service would benefit
from more sand-boxing options turned on for them.
* "systemd-analyze syscall-filter" will now also show system calls
supported by the local kernel but not included in any of the defined
groups.
* .nspawn files now understand the Ephemeral= setting, matching the
--ephemeral command line switch.
* sd-event gained the new APIs sd_event_source_get_floating() and
sd_event_source_set_floating() for controlling whether a specific
event source is "floating", i.e. destroyed along with the even loop
object itself.
* Unit objects on D-Bus gained a new "Refs" property that lists all
clients that currently have a reference on the unit (to ensure it is
not unloaded).
* The JoinControllers= option in system.conf is no longer supported, as
it didn't work correctly, is hard to support properly, is legacy (as
the concept only exists on cgroupsv1) and apparently wasn't used.
* Journal messages that are generated whenever a unit enters the failed
state are now tagged with a unique MESSAGE_ID. Similarly, messages
generated whenever a service process exits are now made recognizable,
too. A taged message is also emitted whenever a unit enters the
"dead" state on success.
* systemd-run gained a new switch --working-directory= for configuring
the working directory of the service to start. A shortcut -d is
equivalent, setting the working directory of the service to the
current working directory of the invoking program. The new --shell
(or just -S) option has been added for invoking the $SHELL of the
caller as a service, and implies --pty --same-dir --wait --collect
--service-type=exec. Or in other words, "systemd-run -S" is now the
quickest way to quickly get an interactive in a fully clean and
well-defined system service context.
* machinectl gained a new verb "import-fs" for importing an OS tree
from a directory. Moreover, when a directory or tarball is imported
and single top-level directory found with the OS itself below the OS
tree is automatically mangled and moved one level up.
* systemd-importd will no longer set up an implicit btrfs loop-back
file system on /var/lib/machines. If one is already set up, it will
continue to be used.
* A new generator "systemd-run-generator" has been added. It will
synthesize a unit from one or more program command lines included in
the kernel command line. This is very useful in container managers
for example:
# systemd-nspawn -i someimage.raw -b systemd.run='"some command line"'
This will run "systemd-nspawn" on an image, invoke the specified
command line and immediately shut down the container again, returning
the command line's exit code.
* The block device locking logic is now documented:
https://systemd.io/BLOCK_DEVICE_LOCKING
* loginctl and machinectl now optionally output the various tables in
JSON using the --output= switch. It is our intention to add similar
support to systemctl and all other commands.
* udevadm's query and trigger verb now optionally take a .device unit
name as argument.
* systemd-udevd's network naming logic now understands a new
net.naming-scheme= kernel command line switch, which may be used to
pick a specific version of the naming scheme. This helps stabilizing
interface names even as systemd/udev are updated and the naming logic
is improved.
* sd-id128.h learnt two new auxiliary helpers: sd_id128_is_allf() and
SD_ID128_ALLF to test if a 128bit ID is set to all 0xFF bytes, and to
initialize one to all 0xFF.
* After loading the SELinux policy systemd will now recursively relabel
all files and directories listed in
/run/systemd/relabel-extra.d/*.relabel (which should be simple
newline separated lists of paths) in addition to the ones it already
implicitly relabels in /run, /dev and /sys. After the relabelling is
completed the *.relabel files (and /run/systemd/relabel-extra.d/) are
removed. This is useful to permit initrds (i.e. code running before
the SELinux policy is in effect) to generate files in the host
filesystem safely and ensure that the correct label is applied during
the transition to the host OS.
* KERNEL API BREAKAGE: Linux kernel 4.18 changed behaviour regarding
mknod() handling in user namespaces. Previously mknod() would always
fail with EPERM in user namespaces. Since 4.18 mknod() will succeed
but device nodes generated that way cannot be opened, and attempts to
open them result in EPERM. This breaks the "graceful fallback" logic
in systemd's PrivateDevices= sand-boxing option. This option is
implemented defensively, so that when systemd detects it runs in a
restricted environment (such as a user namespace, or an environment
where mknod() is blocked through seccomp or absence of CAP_SYS_MKNOD)
where device nodes cannot be created the effect of PrivateDevices= is
bypassed (following the logic that 2nd-level sand-boxing is not
essential if the system systemd runs in is itself already sand-boxed
as a whole). This logic breaks with 4.18 in container managers where
user namespacing is used: suddenly PrivateDevices= succeeds setting
up a private /dev/ file system containing devices nodes — but when
these are opened they don't work.
At this point is is recommended that container managers utilizing
user namespaces that intend to run systemd in the payload explicitly
block mknod() with seccomp or similar, so that the graceful fallback
logic works again.
We are very sorry for the breakage and the requirement to change
container configurations for newer kernels. It's purely caused by an
incompatible kernel change. The relevant kernel developers have been
notified about this userspace breakage quickly, but they chose to
ignore it.
Contributions from: afg, Alan Jenkins, Aleksei Timofeyev, Alexander
Filippov, Alexander Kurtz, Alexey Bogdanenko, Andreas Henriksson,
Andrew Jorgensen, Anita Zhang, apnix-uk, Arkan49, Arseny Maslennikov,
asavah, Asbjørn Apeland, aszlig, Bastien Nocera, Ben Boeckel, Benedikt
Morbach, Benjamin Berg, Bruce Zhang, Carlo Caione, Cedric Viou, Chen
Qi, Chris Chiu, Chris Down, Chris Morin, Christian Rebischke, Claudius
Ellsel, Colin Guthrie, dana, Daniel, Daniele Medri, Daniel Kahn
Gillmor, Daniel Rusek, Daniel van Vugt, Dariusz Gadomski, Dave Reisner,
David Anderson, Davide Cavalca, David Leeds, David Malcolm, David
Strauss, David Tardon, Dimitri John Ledkov, Dmitry Torokhov, dj-kaktus,
Dongsu Park, Elias Probst, Emil Soleyman, Erik Kooistra, Ervin Peters,
Evgeni Golov, Evgeny Vereshchagin, Fabrice Fontaine, Faheel Ahmad,
Faizal Luthfi, Felix Yan, Filipe Brandenburger, Franck Bui, Frank
Schaefer, Frantisek Sumsal, Gautier Husson, Gianluca Boiano, Giuseppe
Scrivano, glitsj16, Hans de Goede, Harald Hoyer, Harry Mallon, Harshit
Jain, Helmut Grohne, Henry Tung, Hui Yiqun, imayoda, Insun Pyo, Iwan
Timmer, Jan Janssen, Jan Pokorný, Jan Synacek, Jason A. Donenfeld,
javitoom, Jérémy Nouhaud, Jeremy Su, Jiuyang Liu, João Paulo Rechi
Vita, Joe Hershberger, Joe Rayhawk, Joerg Behrmann, Joerg Steffens,
Jonas Dorel, Jon Ringle, Josh Soref, Julian Andres Klode, Jun Bo Bi,
Jürg Billeter, Keith Busch, Khem Raj, Kirill Marinushkin, Larry
Bernstone, Lennart Poettering, Lion Yang, Li Song, Lorenz
Hübschle-Schneider, Lubomir Rintel, Lucas Werkmeister, Ludwin Janvier,
Lukáš Nykrýn, Luke Shumaker, mal, Marc-Antoine Perennou, Marcin
Skarbek, Marco Trevisan (Treviño), Marian Cepok, Mario Hros, Marko
Myllynen, Markus Grimm, Martin Pitt, Martin Sobotka, Martin Wilck,
Mathieu Trudel-Lapierre, Matthew Leeds, Michael Biebl, Michael Olbrich,
Michael 'pbone' Pobega, Michael Scherer, Michal Koutný, Michal
Sekletar, Michal Soltys, Mike Gilbert, Mike Palmer, Muhammet Kara, Neal
Gompa, Neil Brown, Network Silence, Niklas Tibbling, Nikolas Nyby,
Nogisaka Sadata, Oliver Smith, Patrik Flykt, Pavel Hrdina, Paweł
Szewczyk, Peter Hutterer, Piotr Drąg, Ray Strode, Reinhold Mueller,
Renaud Métrich, Roman Gushchin, Ronny Chevalier, Rubén Suárez Alvarez,
Ruixin Bao, RussianNeuroMancer, Ryutaroh Matsumoto, Saleem Rashid, Sam
Morris, Samuel Morris, Sandy Carter, scootergrisen, Sébastien Bacher,
Sergey Ptashnick, Shawn Landden, Shengyao Xue, Shih-Yuan Lee
(FourDollars), Silvio Knizek, Sjoerd Simons, Stasiek Michalski, Stephen
Gallagher, Steven Allen, Steve Ramage, Susant Sahani, Sven Joachim,
Sylvain Plantefève, Tanu Kaskinen, Tejun Heo, Thiago Macieira, Thomas
Blume, Thomas Haller, Thomas H. P. Andersen, Tim Ruffing, TJ, Tobias
Jungel, Todd Walton, Tommi Rantala, Tomsod M, Tony Novak, Tore
Anderson, Trevonn, Victor Laskurain, Victor Tapia, Violet Halo, Vojtech
Trefny, welaq, William A. Kennington III, William Douglas, Wyatt Ward,
Xiang Fan, Xi Ruoyao, Xuanwo, Yann E. Morin, YmrDtnJu, Yu Watanabe,
Zbigniew Jędrzejewski-Szmek, Zhang Xianwei, Zsolt Dollenstein
— Warsaw, 2018-12-21
Enjoy!
Zbyszek
More information about the systemd-devel
mailing list