[systemd-devel] Confining ALL processes to a CPUs/RAM via cpuset controller

Wed Jul 20 13:49:14 UTC 2016

On Wed, Jul 20, 2016 at 03:29:30PM +0200, Lennart Poettering wrote:
> On Wed, 20.07.16 12:53, Daniel P. Berrange (berrange at redhat.com) wrote:
> 
> > For virtualized hosts it is quite common to want to confine all host OS
> > processes to a subset of CPUs/RAM nodes, leaving the rest available for
> > exclusive use by QEMU/KVM.  Historically people have used the "isolcpus"
> > kernel arg todo this, but last year that had its semantics changed, so
> > that any CPUs listed there also get excluded from load balancing by the
> > schedular making it quite useless in general non-real-time use cases
> > where you still want QEMU threads load-balanced across CPUs.
> > 
> > So the only option is to use the cpuset cgroup controller to confine
> > procosses. AFAIK, systemd does not have an explicit support for the cpuset
> > controller at this time, so I'm trying to work out the "optimal" way to
> > achieve this behind systemd's back while minimising the risk that future
> > systemd releases will break things.
> 
> Yes, we don't support this as of now, but we'd like to. The thing
> though is that the kernel interface for it is pretty borked as it is
> right now, and until that's not fixed we are unlikely going to support
> this in systemd. (And as I understood Tejun the mem vs. cpu thing in
> cpuset is probably not going to stay the way it is either)
> 
> But note that the non-cgroup CPUAffinity= setting should be good
> enough for many use cases. Are you sure that isn't sufficient for you?
> 
> Also note that systemd supports setting a system-wide CPUAffinity= for
> itself during early boot, thus leaving all unlisted CPUs free for
> specific services where you use CPUAffinity= to change this default.

Ah, interesting, I didn't notice you could set that globally.

> > The key factor here is use of "Before" to ensure this gets run immediately
> > after systemd switches root out of the initrd, and before /any/ long lived
> > services are run. This lets us set cpuset placement on systemd (pid 1)
> > itself and have that inherited by everything it spawns. I felt this is
> > better than trying to move processes after they have already started,
> > because it ensures that any memory allocations get taken from the right
> > NUMA node immediately.
> >
> > Empirically this approach seems to work on Fedora 23 (systemd 222) and
> > RHEL 7 (systemd 219), but I'm wondering if there's any pitfalls that I've
> > not anticipated here.
> 
> Yes, PID 1 was moved to the special scope unit init.scope as mentioned
> above (in preparation for cgroupsv2 where inner cgroups can never
> contain PIDs). This is likely going to break then.

cgroupsv2 is likely to break many things once distros switch over, so
I assume that wouldn't be done in a minor update - only a major new
distro release so, not so concerning.

> But again, I have the suspicion that CPUAffinity= might already
> suffice for you?

Yep, it looks like it should suffice for most people, unless they also
wish to have memory node restrictions enforced from boot.

Regards,
Daniel
-- 
|: http://berrange.com      -o-    http://www.flickr.com/photos/dberrange/ :|
|: http://libvirt.org              -o-             http://virt-manager.org :|
|: http://autobuild.org       -o-         http://search.cpan.org/~danberr/ :|
|: http://entangle-photo.org       -o-       http://live.gnome.org/gtk-vnc :|