[systemd-devel] [RFC] the chopping block

Christian Seiler christian at iwakd.de
Fri Feb 12 23:10:20 UTC 2016


On 02/12/2016 10:34 PM, Lennart Poettering wrote:
> On Fri, 12.02.16 17:49, Simon McVittie (simon.mcvittie at collabora.co.uk) wrote:
> 
>> On 11/02/16 17:06, Lennart Poettering wrote:
>>> 5) Here's the controversial one I think: support for booting up
>>>    without /var. We have kludges at quite a few places because we
>>>    cannot access /var early during boot.
>>
>> I don't think /var is really the same thing as /usr: for a start, it has
>> to be read/write, whereas /usr and / can be read-only for at least the
>> early stages of boot.
>>
>> On stateless systems with a read-only / and /etc, requiring /var to be
>> mounted from the initramfs would mean that the mechanism for setting up
>> /var (NFS or tmpfs or whatever) would have to move into the
>> initramfs.
> 
> Since initrds tend to cover root-on-nfs, root-on-iscsi and so on
> anyway, that sounds like no change in behaviour really..

Well, kind-of. The root-on-nfs and root-on-iscsi are dumbed-down
versions of what's possible once a system is booted.



iSCSI: currently the rootfs works fine, because for the rootfs one
can easily tell the initramfs implementation explicitly that it's
on iSCSI. If your rootfs is on network storage, you have to do so
anyway, so that's not an issue.

But there's no way to determine *just* from looking at /etc/fstab
that a given file system is on iSCSI (or nbd for that matter),
because those just look like regular SCSI block devices (which
don't exist yet if the initramfs hasn't logged in to the iSCSI
session).

This is already somewhat problematic for /usr, but since I've never
seen a setup where people put /usr on iSCSI but / not, so this
was never a huge issue in that regard.

On the other hand, what I have seen in practice are systems with
/var/log on iSCSI.

Also, if you look at how iSCSI login in initramfs works currently,
it's basically just running a binary called 'iscsistart' that tells
the kernel to log in to a specific session where the rootfs is on,
the real daemon isn't started yet. So only a specific session
that is configured separately (!) from all the other configured
sessions is logged into in the initramfs - and the daemon that
reads the proper configuration is only started after the system has
booted.

So in order to support generic filesystems on iSCSI in initramfs,
one would need to start the full daemon already in the initramfs,
plus the full configuration database must be available in the
initramfs (which can change with just some admin commands, after
which the admin would need to remember to regenerate the
initramfs image), and the daemon would need to be modified to
support that.





NFS:

nfsroot is supported only for NFSv2/3 and (depending on the
initramfs implementation) in NFSv4 with sec=sys without idmapping.
If you need NFSv4 with idmapping or want to actually have a secure
NFS mount (e.g. encrypted + authenticated via Kerberos), that
currently doesn't work at all from the initramfs. idmapping
requires that request-key works within the initramfs and properly
calls the nfsidmap binary, which will in turn usually require
the full NSS stack of the system to be available. For Kerberos
you need rpc.svcgssd to be running, as well as have a program like
k5start running to get a ticket for the root user, otherwise the
file system is inaccessible on a kernel level. (And Kerberos also
requires idmapping btw.) Also, in contrast to e.g. iSCSI, where
you could probably get away with killing the daemon before
switching to the rootfs, and then restarting the daemon, both the
idmapping binaries and the rpc.svcgssd have to remain available,
(the former as an upcall from the kernel, the latter as a running
daemon), otherwise the kernel won't be able to properly handle
the filesystem.






And NFS and iSCSI are just two things I have quite a bit of
experience with. You could also imagine that people put /var/log
on sshfs, or any other FUSE filesystem for that matter, which as
of now works, but will break if you introduce the change, because
the vast majority of FUSE filesystems (if any at all) support
running from initramfs. Or you could have /var/log as a bind
mount of a directory within an OCFS2 filesystem on a multi-master
DRBD. It's not that difficult to set up on a normal system, but
good luck getting that to work from an initramfs.





Of course, it's not impossible to make all these setups work.
But it would require changes to a lot of other software that's
currently used, which are probably going to be relatively
painful and it's going to be a lot of work for a lot of other
people.

The maintenance burden in systemd for buffering things in /run
before /var, /var/log, etc. are available is minuscule compared
to that amount of pain this change would cause other people.
Which in turn means what would more likely happen is that this
would not be implemented in many cases and then once the version
of systemd with this requirement hits distributions, this would
break users' systems without them being able to run their setup
as designed. I think that would be really bad.

Note that this is different from /usr: not-mounted /usr was
already broken beforehand, which /var currently isn't. A lot of
the scenarios I've described above haven't worked for /usr
beforehand anyway (e.g. I haven't seen a single distribution
that didn't have Kerberos stuff in /usr even before any UsrMerge
so that /usr via Kerberized NFSv4 wasn't possible anyway) and so
there were already many, many more constraints for /usr, so that
the breakage in that case was quite limited. Also, in contrast
to /usr, where a merged /usr actually has very real advantages,
such as enabling stateless systems, I don't see any advantages
for /var here, other than making systemd simpler in a very
minuscule way.

I don't think that trade-off is warranted.

Regards,
Christian

PS: Btw. if you do run journald already in initramfs (which I
think is a good thing to have), then it still needs to have
code to flush /run/log/journal to /var/log/journal. So in that
case you don't actually gain anything.

-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 801 bytes
Desc: OpenPGP digital signature
URL: <https://lists.freedesktop.org/archives/systemd-devel/attachments/20160213/d4e975c8/attachment.sig>


More information about the systemd-devel mailing list