[systemd-devel] Possible race condition for setting cgroup sticky bit

Mon Apr 8 07:57:38 PDT 2013

> > > > I'm seeing a problem with a service sometimes failing to start due to a
> > > missing cgroup.
> > > > After some debugging I've made the following observations:
> > > >
> > > > After exec_spawn() forks, the child will set the sticky bit for the
> > > > cgroup (in cg_set_task_access) but sometimes, the cgroup is missing
> > > > (lstat returns "No such file or directory").
> > > >
> > > > The cgroup is always created, but the main process will call cg_trim
> > > > (from cgroup_bonding_trim <- cgroup_bonding_trim_list <-
> > > > cgroup_notify_empty <- private_bus_message_filter ...) which will
> > > > remove the cgroup if the sticky bit isn't set.
> > >
> > > Hmm, cg_trim() will ignore groups with the sticky bit set, and the
> > > kernel won't allow us removing groups where there's currently a process
> > > in.
> >
> > I've dumped data from cg_trim and the sticky bit is not set when this
> > occurs. In fact, the state of the sticky bit as seen by cg_trim seems
> > to be the major difference between a proper boot or a broken one.
> 
> Well, but as long as there is a process in the group the kernel should
> already refuse deletion in the group. The sticky bit is hence useful
> only for *empty* cgroups, which is what I don't grok here... In your
> case the child should have created the group and made itself a member of
> it immediately (which a tiny window in between where the group could be
> remvoed, but this should result in immediate total failure of the
> forking, not just a missing cgroup).

I've never seen the fork fail, the error message displayed is always: "Failed at step CGROUP spawning /etc/init.d/rc: No such file or directory" which comes from the failure in the cg_set_task_access.

> > > The code dealing with forked off service processes in execute.c looks
> > > like this: after forking, we first create a group, then add us to it,
> > > and then set the sticky bit for it. Now, there's a tiny window of
> > > opportunity there (and we should fix it...) where cg_trim from PID 1
> > > could run in between which is between creating a group and adding us
> > > into it. But normally, if that fails then the exection of the servie
> > > should be aborted right away. But that's not what you are seeing?
> > >
> > > I will now add some code which avoids the race I pointed out, but I am
> > > not sure that's the same one that you are actually encountering...
> >
> > The cgroup that fails is named after the services. But the service is
> > configured to use the same cgroup as several other services
> > (ControlGroup= is set in the service file).  In this setup, is the
> > child created in the default cgroup and then moved to the configured
> > one or why is the default named cgroup existing at all and being
> > handled?
> 
> No, if you configured a cgroup name then no "default" cgroup naming
> is ever attempted.
> 
> Hmm, which hierarchy are you talking of BTW? Note that cgroup
> memberships in all heirarchies are pretty much orthogonal on the
> kernel-side of things. And systemd will allow you that too.
> 
> > I've noticed that there always exist cgroups for all services,
> > regardless if they are overridden to use another.
> 
> Really? Maybe in different hierarchies?
> 
> It would certainly be a bug if systemd ever creates a cgroup in the "cpu"
> hierachy that is not the one you you configured for the "cpu"
> hierarchy.
> 
> Any chance you can explain in a bit more detail how your cgroups are set
> up and what unit configuration switches you use for that?

Ok, let's see if I can explain what we've done here.

To introduce systemd in our system, we've started with just wrapping rc and all the old initscripts so we can get systemd running first and then afterwards start converting to native services.
The boot is basically two services: legacy_rcS.service (which runs "/etc/init.d/rc S") and legacy_rc3.service (which runs "/etc/init.d/rc 3"). There is also a legacy_rc4.service (wanted by upgrade.target) used for firmware upgrads and similar special system actions.
Journal, udev and syslog runs as separate services outside these wrappers and the idea is to migrate boot script to services a few at a time until the legacy wrappers are empty and can be dropped.

The following is the service file for the runlevel 3 wrapper:
[Unit]
Description=Legacy runlevel 3
Wants=legacy_rcS.service
After=legacy_rcS.service
Conflicts=legacy_rc4.service
[Service]
Type=oneshot
RemainAfterExit=yes
ExecStart=/etc/init.d/rc 3
StandardOutput=tty
Environment=RUNLEVEL=3
Environment=PREVLEVEL=X
ControlGroup=systemd:/system/legacy_rc.service
ControlGroupPersistent=true
KillMode=none

The same cgroup is configured for all the legacy services (rcS, rc3 and rc4).

When looking in sysfs, I see cgroups for all the legacy services, even though the rcS and rc3 services use the configured generic cgroup:
The following is from a working system, when a failure happens, rc and rcS are present, but not rc3:
# ls -d /sys/fs/cgroup/systemd/system/legacy*
/sys/fs/cgroup/systemd/system/legacy_rc.service
/sys/fs/cgroup/systemd/system/legacy_rc3.service
/sys/fs/cgroup/systemd/system/legacy_rcS.service

This was what I meant with "cgroups for all services" exist even though it has been overridden.
Without the ControlGroup= setting, legacy_rc3 and legacy_rcS would have use the cgroups with the same names. But since we've specify that we want a different name, I'm wondering why I still see the default names that we don't want to use.

/Anders