[systemd-devel] Should services be able to run without /proc?
Lennart Poettering
lennart at poettering.net
Wed Feb 10 08:56:30 UTC 2021
On Di, 09.02.21 15:57, Antonius Frie (antonius.frie at ruhr-uni-bochum.de) wrote:
> Hi!
>
> So this is kind of a follow-up to the thread in [1], and the corresponding
> PR in [2].
>
> In short, the PR made some changes to allow for cases where /proc was not
> available in the mount namespace of the service, and added a test [3] to
> make sure that this would work. This test was later removed and rewritten to
> block /sys instead [4], because it turned out that having /proc unavailable
> sometimes caused problems with close_all_fds(), which is called in
> exec_child() after namespaces have been set up.
>
> On current master, services that don't have /proc mounted don't work at all
> anymore, since find_executable_full() ends up opening the given path and
> calling access_fd() on the resulting fd, and access_fd uses /proc/self/fd/*
> to turn the fd back into a path it can call access() on. As far as I can
> tell, the reason for not using access on the path directly is that access_fd
> is more elegant since it avoids a potential race condition.
Yes, we try to move to a mode where for most such things that involve
context switches/credential switches/domain transitions we operate via
O_PATH file handles: i.e. resolve in our original context, until we
only have fds pointing to the final thing, and then do the final
operation only on those fds. This should fix a bunch of races and
potential races for us.
> In addition to this, setup_private_users() also needs access to
> /proc/$pid/{uid_map, gid_map, setgroups} to do its job.
Yes, a multitude of Linux APIs are exposed via /proc/. I think outside
of trivial programs it's very hard to avoid having /proc/. glibc
internally encodes access to it all over the place
too.
> Given all this, I guess my question is whether it is still desirable to
> allow units to run without /proc, especially given that ProtectProc and
> ProcSubset exist now.* If not, it might be nice to just always mount /proc
> if it wouldn't otherwise be there (i.e. if RootImage/RootDirectory is used);
> currently, MountAPIVFS=yes is basically a required option because of this.
> (I guess you could mount proc manually, but then you can't use
> ProtectProc/ProcSubset.) I'm a bit unhappy about this, because MountAPIVFS
> also mounts /sys and /dev, and then you need separate options just to
> protect those again. Either way, maybe it would be good to explicitly state
> this requirement in the documentation?
We could add MountAPIVFS=proc or so as alternative to yes/no, which
would only mount /proc.
Note that on current git it actually also mounts /run/ and that on
current git it also defaults to true if RootImage=/RootDirectory= are
used, see 6119878480aab4c10ad6af33deab221778683807.
You can get force MountAPIVFS=no still btw, to get back the status quo
ante: i.e. a RootImage=/RootDirectory= env without /proc.
> Anyway, I hope that this was okay to post here, I don't really know a lot
> about this and maybe there are good reasons for why things are the way they
> are. I'd be happy about feedback though.
Yes, this is the right place.
If you think the MountAPIVFS=proc thing would be desirable to you,
consider posting an RFE issue asking for it on github. Or even better,
submit a PR.
> * Using both ProtectProc=ptraceable and ProcSubset=pid really doesn't
> let a lot of things through, and I don't think those interfere with any of
> the functions described above. The only thing I'm unsure about is
> setup_private_users(), since that spawns off a child process which then
> goes and writes to /proc/$parent_pid/, but I guess children can ptrace
> their parents? At least it seemed to work when I just tested it.
On traditional Linux any ptracable means "uid matches". With yama lsm
parents can ptrce the children but not vice versa.
Lennart
--
Lennart Poettering, Berlin
More information about the systemd-devel
mailing list