[systemd-devel] more verbose debug info than systemd.log_level=debug?
Michael Chapman
mike at very.puzzling.org
Mon Apr 10 10:49:49 UTC 2017
On Mon, 10 Apr 2017, Lennart Poettering wrote:
> On Mon, 10.04.17 19:38, Michael Chapman (mike at very.puzzling.org) wrote:
>
>> On Mon, 10 Apr 2017, Lennart Poettering wrote:
>>> On Mon, 10.04.17 18:45, Michael Chapman (mike at very.puzzling.org) wrote:
>>>
>>>> On Mon, 10 Apr 2017, Lennart Poettering wrote:
>>>>> On Sun, 09.04.17 10:11, Michael Chapman (mike at very.puzzling.org) wrote:
>>>>>
>>>>>> Don't forget, they've provided an interface for software to use if it needs
>>>>>> more than the guarantees provided by sync. Informally speaking, the FIFREEZE
>>>>>> ioctl is intended to place a filesystem into a "fully consistent" state, not
>>>>>> just a "fully recoverable" state. (Formally it's all a bit hazy: POSIX
>>>>>> really doesn't guarantee anything with sync.)
>>>>>
>>>>> FIFREEZE does considerably more than what you suggest: it also pauses
>>>>> all further changes until FITHAW is called. And that's semantics we
>>>>> really cannot have.
>>>>
>>>> If systemd is just about to call reboot(2), why does it matter?
>>>
>>> Well, in the general case we don't actually call reboot(), because we
>>> instead transition back into the initrd, which then eventually calls
>>> that. At least that's what happens on the major general purpose
>>> distros that have an initrd that does that (for example: Fedora/RHEL
>>> with Dracut).
>>
>> If it's not systemd _inside_ the initrd calling reboot(2), then there's
>> nothing systemd can do about it.
>
> The initrd usually doesn't run a systemd environment anymore, PID 1 is
> usually a shell script of some kind. It might use our "reboot" binary
> and call it with "-ff" (which means it's really just a pure reboot()
> wrapper), but we don't do a umount/kill/detach spree in that case. Or
> in other words: if they do use our reboot utility then they use it in
> pure sysvinit compat mode, where it won't do more than sync() +
> reboot(), exactly the same way as sysvinit() did.
OK, given that there's really no point in pursuing this from the systemd
end.
>>> I am sorry, but just making all accesses hang is just broken. That
>>> can't work.
>>>
>>>> I do think we should attempt to remount readonly before doing the FIFREEZE.
>>>> I thought systemd did that, but it appears that it does not. A readonly
>>>> remount will do what we want so long as no remaining processes have any
>>>> files opened for writing on the filesystem. The FIFREEZE would only be
>>>> necessary when the remount fails.
>>>
>>> We remount everything read-only we can if we cannot unmount
>>> something.
>>
>> Ah, I see the code for that now. I was looking for something after the
>> umount call (specifically, if umount failed), not before.
>
> Well, the scheme works like this: we kill, umount, remount, detach in
> a loop until nothing changes anymore. It's a primitive but robust way,
> to deal with stacked storage, where running processes might pin file
> systems, which might in turn pin devices, which might in turn pin
> backend userspace services, and so on... Hence, yes, we do the umount
> first, and the remount second, but then we'll try another umount again
> and another remount, until this stops being fruitful.
>
>>> But do note that we can't do that in all cases. Most
>>> prominently: consider a process that is running from an executable
>>> that has been updated on disk (specifically: whose binary got deleted
>>> because it was replaced by a newer version). This process will keep
>>> the file pinned, and will block all read-only remounts, as the kernel
>>> wants to mark the file properly deleted first, but it can't since the
>>> process is keeping it pinned.
>>>
>>> This is specifically the case that happened for Plymouth: the binary
>>> probably got updated, hence the process in memory references a deleted
>>> file, which blocks the read-only remounting, in which case we can't do
>>> anything, and sync and remount.
>>
>> OK, so how about this. _After_ the unmount-everything loop we do a freeze +
>> thaw for each remaining filesystem, one filesystem at a time. That won't
>> permanently block processes that are still writing to the filesystems (and
>> why would they be?!), it will ensure that all filesystems' journals are
>> fully flushed (which will make GRUB and other OSs happy), and it won't block
>> the kernel from doing any kind of reboot()-time cleanups you were talking
>> about earlier.
>
> Well, I figure that might work, but it's also fricking ugly: it feels
> like booking a plane ticket that includes free airplane food, just
> because you are hungry: you get considerably more than just the food,
> and you have to sit uncomfortably for too long, end up where you
> didn't want to go, and the food is quite awful too. ("Chicken or
> pasta?")
>
> I'd prefer if the file system folks would simply provide sane
> semantics here, and provide an fsync()-style syscall or ioctl that
> does what is needed here.
Perhaps we might be able to convince them to make reboot() a full "unmount
all remaining filesystems" operation. To be honest, I'm a little
surprised it isn't... but I suppose it's got all the same problems with
ordering between filesystems within the kernel itself.
More information about the systemd-devel
mailing list