[RFC] initoverlayfs - a scalable initial filesystem
Eric Curtin
ecurtin at redhat.com
Mon Dec 11 11:42:15 UTC 2023
I am also thinking, what is the difference between "make the
bootloader load the erofs into contiguous memory" part and doing
something like storage-init.
They are similar approaches, introduce something in the middle to
handle the erofs.
Is mise le meas/Regards,
Eric Curtin
On Mon, 11 Dec 2023 at 11:28, Eric Curtin <ecurtin at redhat.com> wrote:
>
> On Mon, 11 Dec 2023 at 11:20, Eric Curtin <ecurtin at redhat.com> wrote:
> >
> > On Mon, 11 Dec 2023 at 10:06, Lennart Poettering <mzerqung at 0pointer.de> wrote:
> > >
> > > On Fr, 08.12.23 17:59, Eric Curtin (ecurtin at redhat.com) wrote:
> > >
> > > > Here is the boot sequence with initoverlayfs integrated, the
> > > > mini-initramfs contains just enough to get storage drivers loaded and
> > > > storage devices initialized. storage-init is a process that is not
> > > > designed to replace init, it does just enough to initialize storage
> > > > (performs a targeted udev trigger on storage), switches to
> > > > initoverlayfs as root and then executes init.
> > > >
> > > > ```
> > > > fw -> bootloader -> kernel -> mini-initramfs -> initoverlayfs -> rootfs
> > > >
> > > > fw -> bootloader -> kernel -> storage-init -> init ----------------->
> > > > ```
> > >
> > > I am not sure I follow what these chains are supposed to mean? Why are
> > > there two lines?
> >
> > The top line is the filesystem transition, the bottom is more like a
> > process perspective. Will make this clearer in future.
> >
> > >
> > > So, I generally would agree that the current initrd scheme is not
> > > ideal, and we have been discussing better approaches. But I am not
> > > sure your approach really is useful on generic systems for two
> > > reasons:
> > >
> > > 1. no security model? you need to authenticate your initrd in
> > > 2023. There's no execuse to not doing that anymore these days. Not
> > > in automotive, and not anywhere else really.
> >
> > Yes you are right, there is no excuse, the plan was to mount using
> > dm-verity most likely with the details from the initramfs, but
> > admittedly we had not looked into that into great detail.
> >
> > >
> > > 2. no way to deal with complex storage? i.e. people use FDE, want to
> > > unlock their root disks with TPM2 and similar things. People use
> > > RAID, LVM, and all that mess.
> >
> > We had 3 thoughts on this:
> >
> > 1. Just worry about the common use-cases and leave everyone else
> > fallback to the approaches we use today.
> > 2. Try and split up systemd to make it even smaller. We do use
> > systemd-udev in the small initramfs storage-init process so far.
> > 3. Reimplement some things? But as little as possible, on a case by
> > case basis, we certainly don't want to fall into the trap of rewriting
> > systemd that's for sure, systemd does these things very well.
> >
> > Tbh, if we try and implement this in kernelspace a lot of these
> > questions go away. You just teach the kernel to deal with the
> > filesystem image early (say erofs or whatever other filesystem) and
> > have that data where initramfs data currently is. You still pay for
> > the initial read, but you still save a bunch of kernel time.
> >
> > >
> > > Actually the above are kinda the same problem in a way: you need
> > > complex storage, but if you need that you kinda need udev, and
> > > services, and then also systemd and all that other stuff, and that's
> > > why the system works like the system works right now.
> >
> > True, but there is also a bunch of stuff in current initrd's today
> > that aren't required to mount basic storage, but are designed around
> > the whole idea of having an early throwaway filesystem.
> >
> > >
> > > Whenever you devise a system like yours by cutting corners, and
> > > declaring that you don't want TPM, you don't want signed initrds, you
> > > don't want to support weird storage, you just solve your problem in a
> > > very specific way, ignoring the big picture. Which is OK, *if* you can
> > > actually really work without all that and are willing to maintain the
> > > solution for your specific problem only.
> > >
> > > As I understand you are trying to solve multiple problems at once
> > > here, and I think one should start with figuring out clearly what
> > > those are before trying to address them, maybe without compromising on
> > > security. So my guess is you want to address the following:
> > >
> > > 1. You don't want the whole big initrd to be read off disk on every
> > > boot, but only the parts of it that are actually needed.
> > >
> > > 2. You don't want the whole big initrd to be fully decompressed on every
> > > boot, but only the parts of it that are actually needed.
> > >
> > > 3. You want to share data between root fs and initrd
> > >
> > > 4. You want to save some boot time by not bringing up an init system
> > > in the initrd once, then tearing it down again, and starting it
> > > again from the root fs.
> >
> > It's mainly the top 3 that were the goals. And that people have the
> > freedom to consider using heavier weight generic libraries, tools,
> > etc. if they want. You want to use Rust (or languages X, Y, Z) to
> > write something early boot, go ahead! You'll only pay the cost for the
> > larger binary if you actually use it. The week I started tinkering at
> > this, there was a mini-debate on whether we should include glib or not
> > in the initrd. And we are regularly under pressure to reduce boot time
> > at the moment.
> >
> > Number 4 was a convenient way to do an early version of this, stick a
> > process in between systemd and the kernel. But it turns out, it works
> > very well, the only problem is the reimplementation problem really.
> >
> > Theoretically this could be systemd-storage-init -> systemd also. Or
> > systemd and dlopen more libraries as they become available later down
> > the line.
> >
> > >
> > > For the items listed above I think you can find different solutions
> > > which do not necessarily compromise security as much.
> > >
> > > So, in the list above you could address the latter three like this:
> > >
> > > 2. Use an erofs rather than a packed cpio as initrd. Make the boot
> > > loader load the erofs into contigous memory, then use memmap=X!Y on
> > > the kernel cmdline to synthesize a block device from that, which
> > > you then mount directly (without any initrd) via
> > > root=/dev/pmem0. This means yout boot loader will still load the
> > > whole image into memory, but only decompress the bits actually
> > > neeed. (It also has some other nice benefits I like, such as an
> > > immutable rootfs, which tmpfs-based initrds don't have.)
>
> What I am unsure about here, is the "make the bootloader load the
> erofs into contiguous memory" part. I wonder could we try and use the
> existing initramfs data as is. I dunno if
> bootloaders make much assumptions about the format of that data, worst
> case scenario we could encapsulate erofs in the initramfs, cpio looking
> data. Teach the kernel not to decompress and process the whole
> thing and mount it like an erofs alternatively. Does this sound crazy
> or reasonable?
> Sometimes you cannot change the code in a bootloader and it would be
> nice if we could avoid introducing another layer of bootloader.
>
>
> >
> > Yes, lets explore this approach with the kernel community to gather
> > their thoughts. I'm still happy I did the userspace version first,
> > even if we end up doing it in kernelspace because it allowed me to
> > test on various pieces of hardware to see if the benefits are genuine
> > and they are....
> >
> > >
> > > 3. Simply never transition to the root fs, don't marke the initrds in
> > > systemd's eyes as an initrd (specifically: don't add an
> > > /etc/initrd-release file to it). Instead, just merge resources of
> > > the root fs into your initrd fs via overlayfs. systemd has
> > > infrastructure for this: "systemd-sysext". It takes immutable,
> > > authenticated erofs images (with verity, we call them "DDIs",
> > > i.e. "discoverable disk images") that it overlays into /usr/. [You
> > > could also very nicely combine this approach with systemd's
> > > portable services, and npsawn containers, which operate on the same
> > > authenticated images]. At MSFT we have a major product that works
> > > exactly like this: the OS runs off a rootfs that is loaded as an
> > > initrd, and everything that runs on top of this are just these
> > > verity disk images, using overlayfs and portable services.
> > >
> > > 4. The proposal in 3 also addresses goal 4.
> > >
> >
> > I'm hoping we can benefit both use cases, the case where you want to
> > transition to a rootfs and the case where you never want to transition
> > to a rootfs.
> >
> > > Which leaves item 1, which is a bit harder to address. We have been
> > > discussing this off an on internally too. A generic solution to this
> > > is hard. My current thinking for this could be something like this,
> > > covering the UEFI world: support sticking a DDI for the main initrd in
> > > the ESP. The ESP is per definition unencrypted and unauthenticated,
> > > but otherwise relatively well defined, i.e. known to be vfat and
> > > discoverable via UUID on a GPT disk. So: build a minimal
> > > single-process initrd into the kernel (i.e. UKI) that has exactly the
> > > storage to find a DDI on the ESP, and set it up. i.e. vfat+erofs fs
> > > drivers, and dm-verity. Then have a PID 1 that does exactly enough to
> > > jump into the rootfs stored in the ESP. That latter then has proper
> > > file system drivers, storage drivers, crypto stack, and can unlock the
> > > real root. This would still be a pretty specific solution to one set
> > > of devices though, as it could not cover network boots (i.e. where
> > > there is just no ESP to boot from), but I think this could be kept
> > > relatively close, as the logic in that case could just fall back into
> > > loading the DDI that normally would still in the ESP fully into
> > > memory.
> > >
> >
> > I'm certainly a little biased here because I work with ARM, I would
> > like it to be UEFI world, but it's not and convincing every SoC vendor
> > you must use UEFI is hard. I know a UEFI covering solution only would
> > not have much value for my team at least.
> >
> > > (If you are focussing on systems lacking UEFI, then replace the word
> > > "ESP" in the above with a similar concept, i.e. a well discoverable,
> > > unauthenticated relatively simple file system, such as vfat).
> >
> > Yeah, agree, this baseline, I think, is common enough to assume. Like
> > Android Boot Images as an example are basically a UKI binary stuff in
> > a boot partition.
> >
> > >
> > > Anyway, I can't tell you how to solve your specific problems, but if
> > > there's one thing I'd suggest you to keep in mind then it's the
> > > security angle, i.e. keep in mind from the beginning how
> > > authentication of every component of your process shall work, how
> > > unatteneded disk encryption shall operate and how measurement shall
> > > work. Security must be built into things from the beginning, not be
> > > added as an afterthought.
> >
> > Yes and we certainly want something that fits with the UKI models and
> > the other commonplace models around.
> >
> > >
> > > Lennart
> > >
> > > --
> > > Lennart Poettering, Berlin
> > >
More information about the systemd-devel
mailing list