[systemd-devel] unable to attach pid to service delegated directory in unified mode after restart

Thu Mar 3 17:35:36 UTC 2022

Hi folks, I wanted to keep the case as generic as possible but I think it
is important at this point to comment on what we're talking about, so let
me clarify a little bit the case I am dealing with at the moment.

In SchedMD, we want Slurm to support 'Cgroup v2'. As you may know Slurm is
a HPC resource manager, and for the moment we're limited to Cgroup v1. We
actually use the freezer, memory, cpuset, cpuacct and devices controllers
in v1. We think it is already a good time to add a plugin to our software
to make it capable to run on unified systems, and since systemd is widely
used we want to do this integration as best as we can to coexist with
systemd and not get our pids moved or make systemd mad.

We have a 'slurmd' daemon running on every compute node, waiting for
communications from the controller. The controller submits different kinds
of RPCs to slurmd and at one point one RPC can instruct slurmd to start a
new job step for a specific uid. Slurmd then forks twice; the original
slurmd just ends and goes back to other work. The first fork (child) sets a
bunch of pipes and prepares initialization data, then forks again
generating a grandchild. The grandchild finally exec's the slurmstepd
daemon which will be receiving the initialization data, prepare the
cgroups, and finally fork+exec the user software. This can happen many
times in a second because a user can submit a "job array" which with one
single RPC call can submit thousands of steps, and at the same time
thousands of other steps can be finishing at the same time, so the work
that systemd would need to do starting up new scopes/services and/or
stopping them + monitoring all this stuff could be considerable.

After this introduction I have to say that we successfully managed to work
following systemd rules by just starting a unit file for slurmd with
Delegate=yes and creating our own hierarchy inside. Every slurmstepd would
be forked and started in the delegated cgroup and would create its
directory and move itself where it belongs to (always in the delegated
cgroup), according to our needs. Everything ran smoothly until when I
restarted slurmd and slurmstepds were still running in the cgroup, systemd
was unable to start slurmd again because the cgroup was not deleted, since
it was busy with directories and slurmstepds; main reason for this bug.

Note that one feature of Slurm is that one can upgrade/restart slurmd
without affecting running jobs (slurmstepds) in the compute node.

I have read and studied all your suggestions and I understand them.
I also did some performance tests in which I fork+executed a systemd-run to
launch a service for every step and I got bad performance overall.
One of our QA tests (test 9.8 of our testsuite) shows a decrease of
performance of 3x.

But, the positive thing is that we did a test to manually fork+exec one new
Delegated separate service when starting up slurmd, and we moved new forked
slurmstepd pids *manually* into the new cgroup associated with the new
service. This service contains a 'sleep infinity' as the main pid to make
the cgroup not disappear even if no slurmstepds are running. As I say, this
is a dirty test, which works.

After reading your last two emails, I think the most efficient way we need
to go is this one:

Firing an async D-Bus packet to systemd should be hardly measurable.
>
> But note that you can also run your main service as a service, and
> then allocate a *single* scope unit for *all* your payloads. That way
> you can restart your main service unit independently of the scope
> unit, but you only have to issue a single request once for allocating
> the scope, and not for each of your payloads.
>
>
My questions are, where would the scope reside? Does it have an associated
cgroup?
If I am a new slurmstepd, can I attach myself to this scope or must I be
attached by slurmd before being executed?

> But that too means you have to issue a bus call. If you really don't
> like talking to systemd this is not going to work of course, but quite
> frankly, that's a problem you are making yourself, and I am not
> particularly sympathetic to it.
>

I can study this option. It is not that I like or don't like talking to
systemd, but the idea is that Slurm must work in other OSes, possibly
without systemd but still with cgroup v2, and still be compatible with
cgroup v1 and with no cgroup at all. It's thinking about the future, the
less complexity and particularities it has, the more maintainable and
flexible the software is. I think this is understandable, but if this is
not possible at all we will have to adapt.

> > DelegateCgroupLeaf=<yes|no>. If set to yes an extra directory will be
> > created into the unit cgroup to place the newly spawned service process.
> > This is useful for services which need to be restarted while its forked
> > pids remain in the cgroup and the service cgroup is not a leaf
> > anymore.
>
> No. Let's not add that.
>

I could foresee the benefits of such an option, but I can also see the
issues from a systemd perspective of having to deal with remaining
processes of the unit which is being restarted.
I am not sure why you think it is not ok though, but one option like this
would fix all my problems, which is only about the restart of slurmd :)

I am also curious of what this sentence does exactly mean:

"You might break systemd as a whole though (for example, add a process
directly to a slice's cgroup and systemd will be very sad).".

Thank you for all your comments.
I hope to arrive at a good solution compliant to systemd and at the same
time with the flexibility we want.

--
Felip Moll
SchedMD - http://schedmd.com
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://lists.freedesktop.org/archives/systemd-devel/attachments/20220303/960323d5/attachment.htm>