[systemd-devel] Confining ALL processes to a CPUs/RAM via cpuset controller

Wed Jul 20 11:53:51 UTC 2016

For virtualized hosts it is quite common to want to confine all host OS
processes to a subset of CPUs/RAM nodes, leaving the rest available for
exclusive use by QEMU/KVM.  Historically people have used the "isolcpus"
kernel arg todo this, but last year that had its semantics changed, so
that any CPUs listed there also get excluded from load balancing by the
schedular making it quite useless in general non-real-time use cases
where you still want QEMU threads load-balanced across CPUs.

So the only option is to use the cpuset cgroup controller to confine
procosses. AFAIK, systemd does not have an explicit support for the cpuset
controller at this time, so I'm trying to work out the "optimal" way to
achieve this behind systemd's back while minimising the risk that future
systemd releases will break things.

As an example I have a host with 3 NUMA nodes, 12 CPUS and want to have
all non-QEMU processes running on CPUs 0 & 1, leaving 3-11 available
for QEMU machines

So far my best solution looks like this:

$ cat /etc/systemd/system/cpuset.service
[Unit]
Description=Restrict CPU placement
DefaultDependencies=no
Before=sysinit.target slices.target basic.target lvm2-lvmetad.service systemd-journald.service systemd-udevd.service

[Service]
Type=oneshot
KillMode=none
RemainAfterExit=yes
ExecStartPre=/usr/bin/mkdir -p /sys/fs/cgroup/cpuset/system.slice
ExecStartPre=/usr/bin/mkdir -p /sys/fs/cgroup/cpuset/machine.slice
ExecStartPre=/bin/bash -c '/usr/bin/echo "0-2" > /sys/fs/cgroup/cpuset/system.slice/cpuset.cpus'
ExecStartPre=/bin/bash -c '/usr/bin/echo "0" > /sys/fs/cgroup/cpuset/system.slice/cpuset.mems'
ExecStartPre=/bin/bash -c '/usr/bin/echo "3-11" > /sys/fs/cgroup/cpuset/machine.slice/cpuset.cpus'
ExecStartPre=/bin/bash -c '/usr/bin/echo "0-2" > /sys/fs/cgroup/cpuset/machine.slice/cpuset.mems'
ExecStartPost=/bin/bash -c '/usr/bin/echo 1 > /sys/fs/cgroup/cpuset/system.slice/tasks'
ExecStopPost=/usr/bin/rmdir /sys/fs/cgroup/cpuset/system.slice
ExecStart=/bin/true

[Install]
WantedBy=multi-user.target

The key factor here is use of "Before" to ensure this gets run immediately
after systemd switches root out of the initrd, and before /any/ long lived
services are run. This lets us set cpuset placement on systemd (pid 1)
itself and have that inherited by everything it spawns. I felt this is
better than trying to move processes after they have already started,
because it ensures that any memory allocations get taken from the right
NUMA node immediately.

Empirically this approach seems to work on Fedora 23 (systemd 222) and
RHEL 7 (systemd 219), but I'm wondering if there's any pitfalls that I've
not anticipated here.

Conceptually I'm aiming for "Before=*" to say it should run before
everything, but explicitly listing this set of units appears to be
best I can do/

Any thoughts / feedback / suggestions welcome on how to improve this.

Regards,
Daniel
-- 
|: http://berrange.com      -o-    http://www.flickr.com/photos/dberrange/ :|
|: http://libvirt.org              -o-             http://virt-manager.org :|
|: http://autobuild.org       -o-         http://search.cpan.org/~danberr/ :|
|: http://entangle-photo.org       -o-       http://live.gnome.org/gtk-vnc :|