[systemd-devel] unable to attach pid to service delegated directory in unified mode after restart

Wed Mar 16 15:15:23 UTC 2022

On Tue, Mar 15, 2022 at 5:24 PM Michal Koutný <mkoutny at suse.com> wrote:

> On Tue, Mar 15, 2022 at 04:35:12PM +0100, Felip Moll <felip at schedmd.com>
> wrote:
> > Meaning that it would be great to have a delegated cgroup subtree without
> > the need of a service or scope.
> > Just an empty subtree.
>
> It looks appealing to add Delegate= directive to slice units.
> Firstly, that'd prevent the use of the slice by anything systemd.
> Then some notion of owner of that subtree would have to be defined (if
> only for cleanup).
> That owner would be a process -- bang, you created a service with
> delegation or a scope with "keepalive" process.
>
>
Correct, this is how the current systemd design works.
But... what if the concept of owner was irrelevant? What if we could just
tell systemd, hey, give me /sys/fs/cgroup/mysubdir and never ever touch it
or do anything to it or pids residing into it.

> (The above is slightly misleading) there could be an alternative of
> something like RemainAfterExit=yes for scopes, i.e. such scopes would
> not be stopped after last process exiting (but systemd would still be in
> charge of cleaning the cgroup after explicit stop request and that'd
> also mark the scope as truly stopped).
> Such a recycled scope would only be useful via
> org.freedesktop.systemd1.Manager.AttachProcessesToUnit().
>
>
This is also a good idea.

> BTW I'm also wondering how do you detect a job finishing in the case
> original parent is gone (due to main service restart) and job's main
> process reparented?
>
>
slurmstepd connects to slurmd through socket and sends an RPC.
If slurmd is gone, slurmstepd (child) will retry the RPC and remain until
slurmd appears again and responds.

The main process doesn't wait for their child, but instead we do a double
fork to make the child be parented by init process 1.

> BTW 2 You didn't like having a scope for each job. Is it because of the
> setup time (IOW jobs are short-lived) or persistent scopes overhead (too
> many units, PID1 scalability)?
>

It is not that I didn't like it. It is that I observed a delay in step
creation (fork slurmstepd) because sending an async dbus message required
the stepd to wait for the systemd job to be executed, and it can take time;
computationally a lot more than just a mkdir on the cgroup subtree. Just to
put an example, a 'srun hostname' command starts a job which runs a
hostname. Response is instantaneous with mkdir's but it takes almost 1
second with a call to systemd through dbus. Slurm is used for HPC, but also
for HTC (High Throughput Computing), which means hundreds of jobs can be
started in a short period of time, so yes, this delay is critical, and not
only because jobs can be short-lived, but there can be a massive job finish
+ job start at the same time. I just ran one test of our regression and
'systemctl list-unit-files' responsiveness was compromised. Also from the
point of view of a sysadmin this was not ideal, so as you say scalability
of PID1 is also a concern.

This is the reason I will not be using 1 scope per job, and I prefer the
other solution to have 1 single scope with Delegate=yes.

Does it make sense?
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://lists.freedesktop.org/archives/systemd-devel/attachments/20220316/393313d8/attachment.htm>