[systemd-devel] more verbose debug info than systemd.log_level=debug?

Lennart Poettering lennart at poettering.net
Mon Apr 10 10:29:35 UTC 2017


On Mon, 10.04.17 19:38, Michael Chapman (mike at very.puzzling.org) wrote:

> On Mon, 10 Apr 2017, Lennart Poettering wrote:
> > On Mon, 10.04.17 18:45, Michael Chapman (mike at very.puzzling.org) wrote:
> > 
> > > On Mon, 10 Apr 2017, Lennart Poettering wrote:
> > > > On Sun, 09.04.17 10:11, Michael Chapman (mike at very.puzzling.org) wrote:
> > > > 
> > > > > Don't forget, they've provided an interface for software to use if it needs
> > > > > more than the guarantees provided by sync. Informally speaking, the FIFREEZE
> > > > > ioctl is intended to place a filesystem into a "fully consistent" state, not
> > > > > just a "fully recoverable" state. (Formally it's all a bit hazy: POSIX
> > > > > really doesn't guarantee anything with sync.)
> > > > 
> > > > FIFREEZE does considerably more than what you suggest: it also pauses
> > > > all further changes until FITHAW is called. And that's semantics we
> > > > really cannot have.
> > > 
> > > If systemd is just about to call reboot(2), why does it matter?
> > 
> > Well, in the general case we don't actually call reboot(), because we
> > instead transition back into the initrd, which then eventually calls
> > that. At least that's what happens on the major general purpose
> > distros that have an initrd that does that (for example: Fedora/RHEL
> > with Dracut).
> 
> If it's not systemd _inside_ the initrd calling reboot(2), then there's
> nothing systemd can do about it.

The initrd usually doesn't run a systemd environment anymore, PID 1 is
usually a shell script of some kind. It might use our "reboot" binary
and call it with "-ff" (which means it's really just a pure reboot()
wrapper), but we don't do a umount/kill/detach spree in that case. Or
in other words: if they do use our reboot utility then they use it in
pure sysvinit compat mode, where it won't do more than sync() +
reboot(), exactly the same way as sysvinit() did.

> > I am sorry, but just making all accesses hang is just broken. That
> > can't work.
> > 
> > > I do think we should attempt to remount readonly before doing the FIFREEZE.
> > > I thought systemd did that, but it appears that it does not. A readonly
> > > remount will do what we want so long as no remaining processes have any
> > > files opened for writing on the filesystem. The FIFREEZE would only be
> > > necessary when the remount fails.
> > 
> > We remount everything read-only we can if we cannot unmount
> > something.
> 
> Ah, I see the code for that now. I was looking for something after the
> umount call (specifically, if umount failed), not before.

Well, the scheme works like this: we kill, umount, remount, detach in
a loop until nothing changes anymore. It's a primitive but robust way,
to deal with stacked storage, where running processes might pin file
systems, which might in turn pin devices, which might in turn pin
backend userspace services, and so on... Hence, yes, we do the umount
first, and the remount second, but then we'll try another umount again
and another remount, until this stops being fruitful.

> > But do note that we can't do that in all cases. Most
> > prominently: consider a process that is running from an executable
> > that has been updated on disk (specifically: whose binary got deleted
> > because it was replaced by a newer version). This process will keep
> > the file pinned, and will block all read-only remounts, as the kernel
> > wants to mark the file properly deleted first, but it can't since the
> > process is keeping it pinned.
> > 
> > This is specifically the case that happened for Plymouth: the binary
> > probably got updated, hence the process in memory references a deleted
> > file, which blocks the read-only remounting, in which case we can't do
> > anything, and sync and remount.
> 
> OK, so how about this. _After_ the unmount-everything loop we do a freeze +
> thaw for each remaining filesystem, one filesystem at a time. That won't
> permanently block processes that are still writing to the filesystems (and
> why would they be?!), it will ensure that all filesystems' journals are
> fully flushed (which will make GRUB and other OSs happy), and it won't block
> the kernel from doing any kind of reboot()-time cleanups you were talking
> about earlier.

Well, I figure that might work, but it's also fricking ugly: it feels
like booking a plane ticket that includes free airplane food, just
because you are hungry: you get considerably more than just the food,
and you have to sit uncomfortably for too long, end up where you
didn't want to go, and the food is quite awful too. ("Chicken or
pasta?")

I'd prefer if the file system folks would simply provide sane
semantics here, and provide an fsync()-style syscall or ioctl that
does what is needed here.

Lennart

-- 
Lennart Poettering, Red Hat


More information about the systemd-devel mailing list