[systemd-devel] machined: after CPU offline then online, vcpupin KVM guest failed to start

Daniel P. Berrange berrange at redhat.com
Fri Aug 5 10:43:56 UTC 2016


On Fri, Aug 05, 2016 at 12:33:21PM +0200, Dr. Werner Fink wrote:
> On Fri, Aug 05, 2016 at 11:07:50AM +0200, Lennart Poettering wrote:
> > On Thu, 04.08.16 16:19, Cedric Bosdonnat (cbosdonnat at suse.com) wrote:
> > 
> > > Hi Lennart and Werner,
> > > 
> > > On Wed, 2016-08-03 at 16:56 +0200, Lennart Poettering wrote:
> > > > On Wed, 03.08.16 14:46, Dr. Werner Fink (werner at suse.de) wrote:
> > > > > problem with v228 (and I guess this is also later AFAICS from logs of
> > > > > current git) that repeating CPU hotplug events (offline/online). The
> > > > > root cause is that cpuset.cpus become not restored by machined.
> > > > > Please note that libvirt can not do this as it is not allowed to do
> > > > > so.
> > > > 
> > > > This is a limitation of the kernel cpuset interface, and it's one of
> > > > the reasons we do not expose cpusets at all in systemd right
> > > > now. Thankfully, there's an alternative to cpusets, which is the CPU
> > > > affinity controls exposed via CPUAffinity= in systemd, which do much
> > > > of the same, but have less borked semantics.
> > > > 
> > > > We'd like to support cpusets directly in systemd, but we don't do this
> > > > as long as the kernel interfaces are as borked as they are. For
> > > > example, cpusets are flushed out entirely currently when the system
> > > > goes through a suspend/resume cycle.
> > > > 
> > > > If libvirt has hook-ups with cpuset, then it bypasses systemd for
> > > > that.
> > > 
> > > I guess by CPU affinity you mean sched_setaffinity and friends. If that is
> > > the case, then this is constrained by cpuset too as mentioned here:
> > > 
> > > http://www.mjmwired.net/kernel/Documentation/cpusets.txt#53
> > > 
> > > As long as the machine.slice cpuset isn't restored after onlining a CPU again,
> > > then libvirt won't be able to set either the affinity or the cpuset if it
> > > contains that CPU.
> > > 
> > > May be the kernel's behaviour is weird and can be discussed, but libvirt can't
> > > do anything on that bug.
> > 
> > Yeah, to make this clear: I do not blame libvirt for this borkedness
> > at all. I blame the kernel.
> 
> Hmmm ... IMHO it is useless to pass the buck from kernel to user space
> as well do the same from user space back to kernel.  I've an open bug
> from a customer and this bug requires a solution.  AFAICS libvirt can
> not do this but machined could do.

It is not simply a problem wrt to virtual machines, it affects any application
which is using the cpuset controller - VMs is just one such user. So it would
be inappropriate todo it in machined.

Fixing it in userspace is complicated by the fact that different levels or
branches in the cgroup hiearchy are managed by different applications, with
no single application having a single world view. Even if systemd itsef did
have support for the cpuset controller, it would still not have  a global
view of all cgroups, as applications can be created further child cgroups
below the groups managed by systemd, which systemd doesn't track.

Trying to restore correct cpuaffinity after hotplug would thus require that
multiple userspace applications all be aware of the problem and contain
logic to fix their part of the hierarchy. This is further complicated by
the ordering constraints that would require top levels to be fixed before
child levels.

Bearing all this in mind, fixing it in userspace is an incredibly hard
problem which will always be liable to race conditions between applications.

The only choices that are practical are a) not use the cpuset controller
at all, or b) fix the kernel so that it maintains 2 distinct bitmaps,
one for the set of online CPus, and one for the configured affinity in the
cpuset, and thus avoid throwing away data on CPU unplug/plug.

Regards,
Daniel
-- 
|: http://berrange.com      -o-    http://www.flickr.com/photos/dberrange/ :|
|: http://libvirt.org              -o-             http://virt-manager.org :|
|: http://autobuild.org       -o-         http://search.cpan.org/~danberr/ :|
|: http://entangle-photo.org       -o-       http://live.gnome.org/gtk-vnc :|


More information about the systemd-devel mailing list